oduwsdl / ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
MIT License
607 stars 39 forks source link

Capture of Acid Test page has images missing #523

Open machawk1 opened 6 years ago

machawk1 commented 6 years ago

Missing images both from the page itself as well as the reconstructive logo. WARC created with local webrecorder--built, run, and recorded using Docker and the webrecorder web interface: temp-20180822005001.warc.gz

ipwb v. 0.2018.08.20.0040 / 45a46dfcac10dcb06cb172eb76bbb00f071b92ca

Using Docker on macOS in a new Chrome instance, hitting the replay home first, to ensure the freshest service worker.

> ipfs daemon &
> ipwb index temp-20180822005001.warc.gz | ipwb replay --proxy=localhost:5002

screen shot 2018-08-21 at 9 22 50 pm

machawk1 commented 6 years ago

Interestingly and oddly, opening the browser console (Chrome 68) and reloading the URI-M causes many of the memento's images to be displayed. I am guessing this is preventing the SW from rerouting (which I also find to be an odd behavior in devtools are open) and the images are pulled from the live web (graaaagh!). screen shot 2018-08-21 at 9 31 36 pm

ibnesayeed commented 6 years ago

I was able to reproduce it, but I am yet to investigate why it is happening.

anatoly-scherbakov commented 4 years ago

I can reproduce this too. After my last two PRs,

pip install -U .
ipwb replay QmReQCtRpmEhdWZVLhoE3e8bqreD8G3avGpVfcLD7r4K6W

shows a webpage http://localhost:5000/memento/20200418050735/xn--80aesfpebagmfblc0a.xn--p1ai/ with no images.

The direct link to image: http://localhost:5000/memento/20200418050742/https://dalee.cdnvideo.ru/stopcoronavirus.rf/img/logo.svg - works perfectly.

Chrome developer tools says:

logo.svg:1 Failed to load resource: net::ERR_FAILED
map.svg:1 Failed to load resource: net::ERR_FAILED
:5000/memento/20200418050735/http://fonts.googleapis.com/css?family=Roboto:400,500,700,900&display=swap:1 Failed to load resource: net::ERR_FAILED
...

Any of these resources load successfully when requested from a separate browser tab.

After Ctrl + F5, everything on the page is displayed. But reason is that no links are actually rewritten - resources are being fetched from the live Web rather than from the archived data on IPFS.

Have you considered actually rewriting links in page's code? That would allow to replay the archive using any browser with IPFS extension, no need for a special script to do so.

ibnesayeed commented 4 years ago

Have you considered actually rewriting links in page's code? That would allow to replay the archive using any browser with IPFS extension, no need for a special script to do so.

We rely on rerouting when possible, but in some situations we do perform rewrites. Rewriting is not easy, especially for resources dynamically injected by JavaScript. To see how this whole thing works you may want to check the underlying Reconstructive system which has well-documented code and research publications available.

anatoly-scherbakov commented 4 years ago

Agreed about difficulty of rewrites, especially when JS is constructing the page, but maybe there could be a command line flag to control whether to rewrite or not when indexing?

ibnesayeed commented 4 years ago

but maybe there could be a command line flag to control whether to rewrite or not when indexing?

Rewriting at indexing time would be a lossy action and will have no better options available than when we can apply at replay/reconstruction time. In fact resources shared across various pages will have other issues as well. Let me try to enumerate just a few of them:

There are archives that rewrite in advance, such as archive.today, but they are transactional in nature and they archived pages on demand, rather than crawling and maintaining a frontier queue, and they also completely flatten the page markup as their post rendered state and by removing any JS in them. We do not want to that route.

anatoly-scherbakov commented 4 years ago

This is logical. Thank you for the explanation; it seems that using a tool for replayng instead of just opening an IPFS hash in browser is not very much avoidable.

ibnesayeed commented 4 years ago

This is logical. Thank you for the explanation; it seems that using a tool for replayng instead of just opening an IPFS hash in browser is not very much avoidable.

It is possible to move a lot of server logic to the client and have the server only provide index record to the Reconstructive Service Worker, which can have logic to fetch data from IPFS directly and serve synthetic HTTP response to the main window. The replay system can then work by talking to a generic archive index API and act as an in-browser server for the rest of the logic. We discussed about this idea before, but never pursued it.

anatoly-scherbakov commented 4 years ago

That's interesting. For starters, I believe, production usage can be facilitated by IPWB Gateways which permit you to provide a QmHash of an archive and play it back. That's mainly why I started talking about backends: having a Redis db backing this server sounds reasonable.

My interest here is as follows. Imagine you're parsing a publicly available website, say a government page, with the intent of fetching machine readable dataset and putting it onto IPFS. This is a fairly simple task and I've already done that: I am putting a JSON object (or a DAG) of JSON encoded information on the net.

However, people who use that dataset make a decision to trust you. They have to trust that you did not tamper with the data while publishing it.

What can you do to prove your innocence? A very simple thing - you archive the webpage you were parsing, and it is important to do as much as you can to preserve it in its original state, with JS, CSS and images - because people who are going to question you are not necessarily tech savvy enough to read naked HTML.

You archive the page and it gets pinned to multiple nodes. You open source your code. Whatever. And you tag every version of the dataset with IPFS link to archive of the original page it was derived from. That's the use case I am trying to pursue.

ibnesayeed commented 4 years ago

You are after archival fixity, which is a yet unsolved problem, but there have been a few attempts to solving it very recently. I have co-authored such a paper last year (preprint copy available at https://arxiv.org/pdf/1905.12565.pdf). Additionally, there is some work done on this topic using Blockchain (e.g., https://www.gipp.com/wp-content/papercite-data/pdf/wortner2019.pdf). Our colleague @maturban's research focus is archival fixity, you may want to read one of his blog posts describing this challenge (https://ws-dl.blogspot.com/2017/12/2017-12-11-difficulties-in-timestamping.html) along with other publications from him.

What might actually more elegantly solve the issues you are after is something we proposed in a position paper (https://arxiv.org/pdf/1906.07104.pdf) that revolves around the new specification called Web Packaging (also known as Web Bundles), but currently there are missing pieces in it that would actually make it more suitable for archival use cases.

anatoly-scherbakov commented 4 years ago

@ibnesayeed thanks! These materials seem to be very relevant. Will read through.