Replay arbitrary WARCs through subdomain/subpath CID inclusion

ShadowJonathan commented 1 year ago

When looking at this project, I saw that dynamically linking to an "archive" of a website via URLs, if/after I set up IPWB on a subdomain or a website, is not really possible, as IPWB sets itself up to serve only a single archive.

However, I want to use IPWB in combination with some scraping/archiving tools, and simply be able to point IPWB to a CID (IPFS hash) of the WARC file, and have it figure it out.

I'd like to be able to do something like the following;

https://ipwb.example.net/QmXXX
https://baxxx.ipwb.example.net/

Then, it'd figure out the page its fetching via the CID (timeouts are expected if the file isn't readily accessible), and serve that to the end user.

My primary usecase for this, as stated above, is simply to be able to link from a "hash list" to my ipwb site to show my particular archive of a particular site, or to show any random archive anyone else found on the web, an "IPWB archive reader" mode, where it'd automatically fetch, unpack, and display those files to anyone who'd request it.

machawk1 commented 1 year ago

@ShadowJonathan Interesting use case. A concern might be allowing just anyone to hit the https://ipwb.example.net/QmXXX endpoint to add data for your system. Some authentication procedure for your own instance would mitigate this.

The functionality is somewhat hidden behind the /ipwbadmin endpoint but we do support adding WARCs at runtime from ipwb's web interface. Being able to specify a CID in lieu of a local WARC file, especially if sent with the auth headers to ensure that the requestor is allowed to add data to the system like this, would satisfy your use case.

Any comments here, @ibnesayeed?

Some tasks:

[ ] Allow a CID of a WARC file to be specific in the webadmin interface to index WARC files on IPFS at runtime
[ ] Setup authentication process and an API to allow a WARC payload, WARC path, or CID to be used for adding new data at runtime.
[ ] Allow an option for the API to be used without authenticating, as described in @ShadowJonathan use case below.

ShadowJonathan commented 1 year ago

For my specific use-case, I'd wanna waive the requirement for authentication, as I'm explicitly also planning for such an instance to also be an "open terminal".

If anything, I'd then only want a whitelist, or a coupling with the option on the IPFS node to not fetch new data (I vaguely remember that being an option somewhere), authentication can then be enforced elsewhere, such as the IPFS node itself, to add data via the private API.

machawk1 commented 1 year ago

@ShadowJonathan I understand and the option to use no auth should be an option has this gets implemented. I could foresee the scenario also being useful on an ipwb hosted on a LAN or an otherwise private instance (e.g., your laptop).

Can you clarify in your use case how you envision specifying a CID and not fetching new data? Would the assumption be that the data is already available in your local IPFS node?

ShadowJonathan commented 1 year ago

Yes, and that other requests would just fail with 404 or a timeout, depending on how the local IPFS node is being hailed by IPWB.

oduwsdl / ipwb

Replay arbitrary WARCs through subdomain/subpath CID inclusion #790