ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.
https://github.com/ipfs/boxo#readme
Other
186 stars 88 forks source link

Gateway: WARC and WACZ archive replay #524

Closed hacdias closed 8 months ago

hacdias commented 8 months ago

This is mostly food for thought, and perhaps Boxo is not the right place for this.

What

When opening a WARC or a WACZ archive through the gateway, we should be able to directly replay its contents, instead of just downloading it. Therefore, it would be part of the "trusted" gateway. Some interesting links:

How

I see two main ways:

  1. Either implement our custom WARC/WACZ replaying website
  2. Somehow integrate ReplayPage into our gateway for this kind of files: https://replayweb.page/docs/embedding
Jorropo commented 8 months ago

Is this similar to MP4 or webm where we need to update to mime type detection and send the right header or does this needs other server side logic ?

lidel commented 8 months ago

My understanding is this is about HTML+JS reader returned for specific content type.

The way I see this, it is already possible: if you publish the HTML replayer along with WACZ.

WebRecorder calls this "self-hosting" in the docs you linked. This is what https://webrecorder.github.io/save-tweet-now/ does :-)

Demo: https://bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y.ipfs.w3s.link/

(takes a while to announce new CID, but you can get data out of web3storage instantly via https://w3s.link/ipfs/bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y?format=car and import to local node, then open via localhost subdomain gateway, will work fine)

$ ipfs ls bafybeifeg4nr2gdxw4ees7wta2qdg6ngycuzms3iudtjvrwbmf7nrdpu7y
bafybeihg4khv4s4hjt6kko2c4gxi5n3mjqexj3jblllo2fsf4kvv366gmq 31076   favicon.ico
bafybeiekl2fjylgfa5zaadjl6gev63kybtv6eei22te6fecp3nitx6wbtq 793     index.html
bafybeicyiq4443ymeqhoqinwsjg6r4crhemhvflpyozyyo5nnvmfr6uxoe -       replay/
bafybeiepzlhosc52ehpubmmmcsg6edsypvmozvfegjcni5ivceljmmzdpa 474493  ui.js
bafybeifgvpnjrr5zpeyx46qhqn7u35fhs4srto46xvn3py35iw2jui3nay -       webarchive/
bafybeiho5npz7tetl24pefenmuviggyucrj6qerry3whmqu2w6zruemas4 1540637 webarchive.wacz

So it works already, just matter of putting .wacz in a directory with index.html replayer.

And I think that is enough.

We don't want to be responsible for deciding what replayer version should run when WACZ publisher did not specified one.

The self-hosting (shipping an index.html replayer along with .wacz archive) makes more sense:

@hacdias I think this means we can close this (bundling replayer with wacz works already, and bundling with boxo is out of scope)?

Only potential UX improvement that comes to mind is to include some helptext next to .wacz in generated HTML directory listings, when there is no index.html, hinting at embedding#self-hosting.

hacdias commented 8 months ago

@lidel you're right, I got carried over by my emotions on this one 😆 and immediately opened an issue. If you use their browser extension, you can configure it to connect to your IPFS node. From there, it is able to "share" to IPFS and get a CID back. The CID will already have the reader itself, as well as the WARC file.

Therefore, I think we can close this.