oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
617 stars 39 forks source link

Some URLs not loaded from localhost though present in cdxj #335

Closed akavel closed 6 years ago

akavel commented 6 years ago

With the attached .warc.gz and the attached .cdxj (zipped), when opening http://localhost:5000/20171207224241/serv.peterme.net/cross-platform-guis-and-nim-macros.html, not all resources are loaded from localhost. Some of them are still pulled from the Web, though they seem to be present both in the .warc and in the .cdxj. From the Firefox console, those seem to be:

which seems to roughly match the "memento/null/..." ones in the log below:

C:\dnload\ipfs-etc>\Python27\python.exe ipwb/ipwb replay gui-library-for-nim.cdxj
IPWB replay started on http://localhost:5000
CDXJ Line: net,peterme,serv)/cross-platform-guis-and-nim-macros.html 20171207224241 {"locator": "urn:ipfs/QmU9SkG1gVK7tAAocVjyMDXjZRyByXbCWRBFC8Ednr4MDc/QmNmkkFGiPyxsU1vXT3rLbRN4HJA3Bi8FeRXojAwtVCeFH", "mime_type": "text/html", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/cross-platform-guis-and-nim-macros.html in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R https://fonts.googleapis.com/css?family=Open+Sans from gui-library-for-nim.cdxj
Getting CDXJ Lines with https://fonts.googleapis.com/css?family=Open+Sans in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=open+sans?family=open+sans 20171207224241 in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with fonts.googleapis.com/css?family=open+sans?family=Open+Sans in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=open+sans?family=open+sans in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at fonts.googleapis.com/css?family=open+sans?family=Open+Sans
[]
Getting CDXJ Lines with fonts.googleapis.com/css?family=open+sans?family=Open+Sans in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=open+sans?family=open+sans in CDXJ at gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R https://fonts.googleapis.com/css?family=Arvo:700 from gui-library-for-nim.cdxj
Getting CDXJ Lines with https://fonts.googleapis.com/css?family=Arvo:700 in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=arvo:700?family=arvo:700 20171207224242 in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with fonts.googleapis.com/css?family=arvo:700?family=Arvo:700 in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=arvo:700?family=arvo:700 in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at fonts.googleapis.com/css?family=arvo:700?family=Arvo:700
[]
Getting CDXJ Lines with fonts.googleapis.com/css?family=arvo:700?family=Arvo:700 in gui-library-for-nim.cdxj
Could not find com,googleapis,fonts)/css?family=arvo:700?family=arvo:700 in CDXJ at gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/styles.css from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/styles.css in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/styles.css 20171207224242 {"locator": "urn:ipfs/QmTB2Pwh7yWTj9Ae9HgncUVh1B9ThmjGyRyxxM4JvUvXAn/QmPcF5ojR9SrPuFzyPxhQcV7QiGcnAtYEYJQyRngwkCVsP", "mime_type": "text/css", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/styles.css in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/cmun-serif.css from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/cmun-serif.css in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/cmun-serif.css 20171207224242 {"locator": "urn:ipfs/QmPbpqTGYg5yXEghYPVCgUwWRa1vmFWbZytG82JUxXJhEB/QmSt8oLGBo38ZoVBPxm4zYkp3s6C9ndHTw1krDetvYy9Xt", "mime_type": "text/css", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/cmun-serif.css in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/github.css from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/github.css in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/github.css 20171207224242 {"locator": "urn:ipfs/QmWTyLd2WC68N5zpE1HbVcdUc9AQJ8jSdLV6FoDuUyFJY6/QmbqrjPfRrBeq3Ve1r4pTQ5FWugzJ9vRAFD9QRoMjVZwjo", "mime_type": "text/css", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/github.css in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/highlight.pack.js from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/highlight.pack.js in gui-library-for-nim.cdxj
Could not find net,peterme,serv)/highlight.pack.js in CDXJ at gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/img/back.svg from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/img/back.svg in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/img/back.svg 20171207224242 {"locator": "urn:ipfs/QmSmNGq7Fbq9bKFEJEMNXDXAMDYCMgEeb6pGYMQR8WtF5o/QmdS4GdtSt8EUbQ2GZSaBVNRTRFAVAD6exB3811KM2EMkb", "mime_type": "image/svg+xml", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/img/back.svg in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/img/note.svg from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/img/note.svg in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/img/note.svg 20171207224242 {"locator": "urn:ipfs/QmSDZUoDHQ264X5ccjLbX2dqRPByHpZHkFti51MDko8WgN/QmRNfesY2DaBuqNMR8mHGS6qGuwNdsvFN4xJiE3sByKEQ8", "mime_type": "image/svg+xml", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/img/note.svg in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/img/rss.svg from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/img/rss.svg in gui-library-for-nim.cdxj
CDXJ Line: net,peterme,serv)/img/rss.svg 20171207224242 {"locator": "urn:ipfs/Qmbv4eh4RANiQP3GDoAaYxpwYDAdnATUUXTDxBFogUi2pB/QmcaQaxF5XQGfpc32msacVuh5vEdx2m3xfy4L2ztkEA91k", "mime_type": "image/svg+xml", "status_code": "200"}
Getting CDXJ Lines with serv.peterme.net/img/rss.svg in gui-library-for-nim.cdxj
Getting CDXJ Lines with the URI-R http://serv.peterme.net/functions.js from gui-library-for-nim.cdxj
Getting CDXJ Lines with http://serv.peterme.net/functions.js in gui-library-for-nim.cdxj
Could not find net,peterme,serv)/functions.js in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/img/link.svg in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/img/link.svg in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/img/link.svg in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/img/link.svg
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/img/link.svg in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/img/link.svg in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunbx.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunbx.woff
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunbx.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.woff in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunrm.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunrm.woff
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunrm.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.woff in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunti.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.woff in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunti.woff
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunti.woff in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.woff in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunbx.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunbx.ttf
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunbx.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunbx.ttf in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunrm.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunrm.ttf
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunrm.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunrm.ttf in CDXJ at gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ Line: None
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunti.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.ttf in CDXJ at gui-library-for-nim.cdxj
CDXJ lines with URI-R at memento/null/http://serv.peterme.net/cmunti.ttf
[]
Getting CDXJ Lines with memento/null/http://serv.peterme.net/cmunti.ttf in gui-library-for-nim.cdxj
Could not find memento)/null/http:/serv.peterme.net/cmunti.ttf in CDXJ at gui-library-for-nim.cdxj

I believe this may be coming from serviceWorker.js, though I'm not 100% sure.

ibnesayeed commented 6 years ago

I have a feeling that this problem is happening in the ServiceWorker implementation. where it is failing to reroute requests properly. This could be due to either wrong/incomplete rerouting logic or not registering the ServiceWorker in the first place. Are you using FireFox in the private browsing mode? If so, that disables ServiceWorker. It was also disabled in FF 45 and FF 52 ESR versions.

machawk1 commented 6 years ago

I can replicate the issue in Chrome. A theory is that the SW is not parsing out the request url correctly (hence null) then is forwarding the request for /memento/null/{URI-R} back to the replay system, which fails.

akavel commented 6 years ago

I modified the SW slightly to display more info:

    request = reroute(event.request, referrerDatetime) // Only embedded resources
    //console.log('REROUTING request for ' + event.request.url + ' to ' + request.url)
    console.log('REROUTING request for ' + event.request.url + ' to ' + request.url + ' REF ' + event.request.referrer + ' ***')
    //console.log('REROUTING request for ' + event.request.url + ' to ' + request.url + ' for ' + JSON.stringify(event.request, null, 4))
    let m = event.request.referrer.match(/\/([0-9]{14})\//)
    console.log(m)
    if (m !== null) {
      m = m[1]
    }
    console.log(m)

and at some point I'm starting to get log lines like below:

REROUTING request for http://serv.peterme.net/img/link.svg to http://localhost:5000/memento/null/http://serv.peterme.net/img/link.svg REF http://serv.peterme.net/styles.css
null
null
REROUTING request for http://serv.peterme.net/cmunbx.woff to http://localhost:5000/memento/null/http://serv.peterme.net/cmunbx.woff REF http://serv.peterme.net/cmun-serif.css
null
null

though in the same console earlier logs were ok:

REROUTING request for http://serv.peterme.net/styles.css to http://localhost:5000/memento/20171207224241/http://serv.peterme.net/styles.css REF http://localhost:5000/20171207224241/serv.peterme.net/cross-platform-guis-and-nim-macros.html
Array [ "/20171207224241/", "20171207224241" ]
20171207224241

or

REROUTING request for http://serv.peterme.net/styles.css to http://localhost:5000/memento/20171207224241/http://serv.peterme.net/styles.css REF http://localhost:5000/20171207224241/serv.peterme.net/cross-platform-guis-and-nim-macros.html
Array [ "/20171207224241/", "20171207224241" ]
20171207224241

Don't have idea what to debug further at this point.

machawk1 commented 6 years ago

The event argument in self.addEventListener('fetch', function (event) { within serviceWorker.js normally contains a URI-M with an embedded datetime. For example, the request for http://localhost:5000/20171207224241/serv.peterme.net/cross-platform-guis-and-nim-macros.html is in the referrer attribute of the event passed in.

For some URI-Ms, like the ones @akavel listed, the referrer property of the event passed in is null.

akavel commented 6 years ago

@machawk1 Actually, the referrer seems not null, only without the datestamp (so referrerDatetime is null). See the examples I posted just above (see the string after REF).

ibnesayeed commented 6 years ago

I was looking at the logic how referrerDatetime is extracted and I can see there is no fallback for situations when a referrer property is missing or does not match the RegEx pattern.

machawk1 commented 6 years ago

ServiceWorker referrer is a URI-M

good

ServiceWorker referred is a live web URI

bad

Latter does not have a datetime to scrape out w/ the regex.

Is it possible that the referrer is stored in the header, being propagated to the replay system, then being used as the basis for replay?

ibnesayeed commented 6 years ago

I think I know the reason, but the solution is more involved. I have hinted about this issue and a potential solution in the SW we published last year in JCDL. The context of cascaded requests is set based on their parent resource, which can be fixed by issuing a fabricated client-side redirect to the resolved URI-M so that all the successive requests are in the right context.

ibnesayeed commented 6 years ago

Current SW implementation is very rudimentary, which does not account for many situations.

ibnesayeed commented 6 years ago

This problem will usually occur in requests that are not originated from the main HTML file, but from a secondary source such as an image or font file being requested from withing a CSS file that is included in the HTML page.

A quick and dirty solution for now would be to store last known referrerDatetime in the localStorage and use that when referrerDatetime is null.

ibnesayeed commented 6 years ago

But the localStorage based solution (or a global variable based solution for that matter) may cease to work as expected when multiple composite mementos are requested in a non-sequential manner.

machawk1 commented 6 years ago

We could check if the referrer is a URI-M and if not (in case like you described, @ibnesayeed), redirect to the /memento/*/URI-R endpoint.

ibnesayeed commented 6 years ago

redirect to the /memento/*/URI-R endpoint.

This will not help, because this will return a list of mementos (if more than one captures are available), without a clue of which one to pick.

machawk1 commented 6 years ago

If we had the TimeGate endpoint (#105) functioning and we passed the Accept-Datetime of the root memento to the endpoint, we could use that as the basis for date resolution (and thus, URI-M).

ibnesayeed commented 6 years ago

If we had the TimeGate endpoint (#105) functioning and we passed the Accept-Datetime of the root memento to the endpoint, we could use that as the basis for date resolution (and thus, URI-M).

If you have the datetime of the root memento, then you don't really need any other end point. The problem here is, you don't have access to the datetime of the root memento by the time you make secondary level requests. As I said, a quick and dirty solution would be to store root memento's datetime in the localstorage or in a global variable and update it when you another root memento is requested, or use it when referrer does not have that info. The problem with this approach is, when a new root memento is requested before all resources of the previous composite memento are loaded, you will end up overwriting the datetime.

machawk1 commented 6 years ago

Does the possibility exist for intercepting the requests from embedded resources using the service worker?

Based on using a very similar method utilizing localStorage in the past, I think it would not be good to go that route.

ibnesayeed commented 6 years ago

For a more robust approach to maintain the same-origin boundary context, read the Methodology section of the Client-side Reconstruction of Composite Mementos Using ServiceWorker paper.

ibnesayeed commented 6 years ago

Does the possibility exist for intercepting the requests from embedded resources using the service worker?

Yes, that's what I am referring to in the paper and mentioned here a few times in comments already. The idea is to temporarily cache the response of the final URI-M and return a fabricated redirect response to the client. Then in the successive request, return the response from the cache which will have the right origin context.

machawk1 commented 6 years ago

No code in https://github.com/oduwsdl/reconstructive still. :|

The localStorage solution is not robust. Could you put together an example per the Methodology section that does what we need in actuality and not in theory?

ibnesayeed commented 6 years ago

Perhaps I should populate Reconstructive repo that will take care of this issue. Putting an example around this approach is almost half the work of writing whole Reconstructive logic. In the interim, you might just want to use a global variable (or localstorage) to mitigate this immediate issue. This approach is far from being good, but will do the trick until I push something more thoughtful in the other repo. I will try to spare some cycles for that tomorrow or over the weekend.

ibnesayeed commented 6 years ago

I think the reconstructive repo is in usable state now. I did not test it in complex situations yet, but those can be discovered and fixed later as we encounter them. Some documentation is certainly needed though. Since ipwb is not Docker-friendly yet, I don't have the environment set up to test it.

ibnesayeed commented 6 years ago

@akavel and @machawk1 you might want to test this again with the latest release. Hopefully it should be fixed now after the merger of PR #339, if not then report new findings.

akavel commented 6 years ago

Looks good to me now, thanks! 😄

On first look, there are still some requests in the Firefox console reported as external (non-localhost), but when I take a look at the response details, they show "InterPlanetary Wayback Replay/...", so this makes me feel good and safe now :)

Thanks a lot!!! :) :) :)

machawk1 commented 6 years ago

@akavel Thank you for circling back to this and confirming the fix. 😄