oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
606 stars 39 forks source link

Sanitize URIs in TimeMaps #295

Closed ibnesayeed closed 6 years ago

ibnesayeed commented 6 years ago

It is a good idea to accept requests with or without the protocol part in the URI-R, but when generating TimeMaps we should add those in every place where it makes sense.

machawk1 commented 6 years ago

@ibnesayeed I am aware of Archive-It using the schemeless (e.g., //odu.edu) approach if a scheme is not present. It seems wrong to artificially add a scheme when the site may have been served in another but is not present based on the archiving tool. Thoughts?

ibnesayeed commented 6 years ago

We do have that information in the CDXJ under original URI field. So, in URI-Ms, report based on what was captured, or if your replay system does not distinguish between URIs with or without those then you can stick with one approach and be consistent about it. URI-Ms are the promise made by the replay system that whatever URI-M is reported there will produce intended result and perhaps won't change in future. Apart from URI-Ms, there are other places such as relations self and original where sanitized URIs can go.

machawk1 commented 6 years ago

To verify, the task you request is to change <memento.us/>; rel="original", in Link TimeMaps and !meta {"original_uri": "memento.us/"} in CDXJ TimeMaps to be <http://memento.us/>; rel="original", and !meta {"original_uri": "http://memento.us/"}, respectively?

I can foresee this breaking tools (e.g., Mink) that look to the URI-R to "jump back" to the live Web only to find that the URI-R was never served under that scheme (be it http or https).

I also don't think "//memento.us" is proper here, as it banks on the relativity of the contextual representation (i.e., the TimeMap) per RFC3896. Then again, per that same RFC, a URI in the context of the Link-formatted TimeMaps (whose fundamental syntax are defined in RFC8288) must include a scheme. Thus, the URI-R in the Link-formatted TimeMap must have a scheme for the TimeMap to be Memento compliant.

What do you think the right approach is to artificially tacking on a scheme to a URI-R in a Link-formatted TimeMap, @phonedude?

ibnesayeed commented 6 years ago

To verify, the task you request is to change ; rel="original", in Link TimeMaps and !meta {"original_uri": "memento.us/"} in CDXJ TimeMaps to be <http://memento.us/>; rel="original", and !meta {"original_uri": "http://memento.us/"}, respectively?

Yes!

I can foresee this breaking tools (e.g., Mink) that look to the URI-R to "jump back" to the live Web only to find that the URI-R was never served under that scheme (be it http or https).

If you read the raw value (without inferring the protocol from the context of TimeMap URI) of the rel=original, tools like Mink will behave no different, because they will default to http:. Not adding protocol does not magically tell the client what was the original protocol.