oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
617 stars 39 forks source link

Archived non-absolute HTTP redirects are not replayed correctly #456

Open machawk1 opened 6 years ago

machawk1 commented 6 years ago

According to RFC7231, the HTTP location response header may be a URI fragment, e.g., Location: /foo.html.

When replaying a redirect from http://localhost:5000/memento/20180727203127/example.com/ to http://localhost:5000/memento/20180727203127/example.com/anotherURI as in https://github.com/oduwsdl/ipwb/blob/master/ipwb/samples/warcs/redirectRelative.warc, the user is instead redirected to http://localhost:5000/memento/20180727203127//anotherURI.

This has to do with the logic of URI resolution to the replay system. Ideally, the ServiceWorker would handle this but it does not appear to do so.

ibnesayeed commented 6 years ago

I was making some changes in the Reconstructive after which rewriting of the Location header on the server-side will not be needed.

machawk1 commented 6 years ago

Ok, are you planning on merging those changes into ipwb soon or is the work needed in Reconstructive extensive?

ibnesayeed commented 6 years ago

I will perhaps merge it tomorrow. I have made necessary changes, but did not get a chance to test it. I left those uncommitted changes in my office machine and came home.

ibnesayeed commented 6 years ago

It looks like it is not possible currently as necessary information (for example the Location header) is not exposed when redirects are handled in manual mode (which is necessary to maintain the SW scope boundary). So, for now we will have to rewrite it on server side, but only when the location is an absolute URL (i.e., it starts with /^https?:\/\//i).

Related discussions:

machawk1 commented 6 years ago

@ibnesayeed Is the Location header accessible to the SW when the URL is not absolute?

ibnesayeed commented 6 years ago

No, for the response type opaqueredirect, status is 0, status message is the empty byte sequence, header list is empty, body is null, and trailer is empty. The information is present, but not exposed to JS/SW. The system is working for non-absolute URIs because SW lets each redirect response go to the browser, then when the browser follows the redirect it catches the request and makes necessary changes in it. In case of absolute URIs, the follow up request goes out of the scope of the SW.

machawk1 commented 6 years ago

@ibnesayeed I think this ticket is still currently an issue. This may be due to our current implementation of server-side rewriting of the location header to scheme://host/memento/date/urir. ipwb replay is interpreting /somefragment in an archive's Location: /somefragment as "/somefragment" being the urir instead of resolving it to archivedhost/somefragment.

To fix this, we need to "MUST process the redirection as if the value inherits the fragment component of the URI reference used to generate the request target". (RFC7231)

ibnesayeed commented 6 years ago

Changes in #461 might fix it, but they need to be tested in different scenarios.

machawk1 commented 6 years ago

We need to handle non-URI (path) redirects on server-side, as the client is unable to access location header.