oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
605 stars 39 forks source link

Using a different attribute name than "locator" in CDXJ #41

Open machawk1 opened 7 years ago

machawk1 commented 7 years ago

The value for this field is a URN, not a "locator" per se. @ibnesayeed Do you have a suggestion for a better name? @phonedude noted this at one point.

ibnesayeed commented 7 years ago

It's a difficult situation. It could be name, identifier, or a location. The role of this field in this context is something that might change in a broader perspective or even when this system evolves to other models. Finding a term that is generic enough while being accurate is challenging. I will give more thoughts about it.

machawk1 commented 7 years ago

Any further thoughts since October about a better name, @ibnesayeed ?

ibnesayeed commented 7 years ago

We can perhaps call it urn or uri.

machawk1 commented 7 years ago

@ibnesayeed Those seem fitting albeit not nearly as "user-friendly" as "locator", which might be a moot point if the intention of ipwb CDXJs is to be machine readable. Any other recommendations beyond urn or uri before we switch over, @phonedude ?

phonedude commented 7 years ago

I've lost the thread -- where is "locator" used as an attribute?

ibnesayeed commented 7 years ago

@phonedude in the CDXJ (index) files we store references to the hashes of the headers and payload blocks of responses in the following manner.

- - {"..": "..", "locator": "urn:ipfs/{header_digest}/{payload_digest}", "..": ".."}

The term locator was something that @weiglemc questioned about if it is really something that tells about the location of the resources. That's why we were looking for better alternatives.

phonedude commented 7 years ago

Definitely should not be called a "locator", since that would suggest URL, which it clearly is not. URI or URN would be more accurate, but repetitive and not nearly as descriptive as something like "header-payload-digests".

ibnesayeed commented 7 years ago

I would stay away with something like header-payload-digests because we are thinking about it in a more general terms so that the same field can be used in other replay systems such as OWB or PyWB where the field would hold reference to the corresponding WARC file with offset and length like urn:warcs/{offset}/{length}/{warc_file_name_or_path}/. In fact the upcoming model of IPWB is planned to not have references to the header and payload, but a single standard ipfs: URI reference to a memento node that will internally point to all the related pieces using IPLD.

machawk1 commented 6 years ago

Any further thoughts on this naming, @ibnesayeed? Could the field value ever be a uri but not a urn?

Once we change this name, should we have some adaptation considerations for older versions of ipwb that used locator?

ibnesayeed commented 6 years ago

Any further thoughts on this naming

I don't have a good name right now.

Could the field value ever be a uri but not a urn?

Yes! The reason why we used this style in the first place rather than keeping headers and payload hashes under separate attributes, so that we can generalize it. If a record is stored on an HTTP URL we can use that directly or if a content is to be fetched from WARC file we can have something like urin:warcs/{offset}/{length}/{warc_file_name_or_path}/. So, it was a generalization effort.

Once we change this name, should we have some adaptation considerations for older versions of ipwb that used locator?

Changing this name is about standardizing terminologies used in CDXJ files for archival indexing purposes, irrespective of the tool they are used in. Once such a change is made, we will have a few choices: 1) have an fallback keyword in the replay to look for the old name for a while, 2) provide a migration script/command that changes old CDXJ files in the new style, or 3) if the user base of the tool is small, we can just introduce this breaking change and inform in the release not and the README file.