Publish ZIM files to IPFS

kelson42 commented 3 years ago

We are moving forward to use IPFS as (part of) distribution strategy.

Therefore we need to publish (part of) all of our ZIM files on IPFS.

The uploading of ZIM files is usually the duty of Zimfarm, the "CMS" been then in charge to decide if we put them in the official published library (or not).

The point here is that there is a difficulty because:

To avoid storage duplication (or large transfers), the IPFS node has to run on download.kiwix.org
The IPFS node will take a bit of time to read the whole file and compute the IPFS hash/URL

Therefore I believe the zimfarm-receiver Docker container which runs on download.kiwix.org should somehow require the IPFS-node Docker container to read/share the new ZIM file and then give back the IPFS URL.

Maybe we could achieve to get the ZIM file hashed at zimfarm-receiver demand and then been put online (sharing within the DHT network) one time the CMS decided to publish the file.

kelson42 commented 3 years ago

@lidel Would you be able please to explain briefly how the zimfarm-receiver Docker container could deal with the IPFS-node (to be installed on download.kiwix.org) to get the file published on IPFS?

rgaudin commented 3 years ago

Just to be sure, there is no uploading to IPFS per say. We'd run an IPFS node on download.kiwix.org which currently is our storage repository.

On that server, we'd need piece of software to decide to expose/publish zim files on IPFS. This decision would probably be based on the zimcheck output I think

So what we need is this decision tool and we'd need it to report-back to the zimfarm about the hash/url for files. Main difficulty is to retrieve the TaskID of a file as this is a concept we don't have on the zimfarm.

kelson42 commented 3 years ago

@rgaudin considering that the zimfarm should not have the duty to publish or not, but I would like the zimfarm to know and share the information about all the URLs to access the content, I believe that everything out of the zimfarm should be published in IPFS local node.

rgaudin commented 3 years ago

OK, makes sense. We do have URLs for stuff that are not published

lidel commented 3 years ago

So I believe you want to run go-ipfs somewhere. It works fine in Docker (docs) if that makes things easier, but it's just a single binary, so whatever works best for you.

Given the size of ZIMs importing data to IPFS datastore (flatfs/badger) is not an option as we want to avoid duplication. Luckily, there are two ways of avoiding data duplication and still publish data over IPFS network:

filestore:
ipfs config --json Experimental.UrlstoreEnabled true → restart ipfs daemon, then: ipfs add --nocopy --cid-version=1 /path/to/foo.zim
- allows files to be added without duplicating the space they take up on disk
- requires IPFS node to have access to file over local filesystem
- more efficient than urlstore below
urlstore:
ipfs config --json Experimental.FilestoreEnabled true → restart ipfs daemon, then: ipfs add --nocopy --cid-version=1 http://192.168.0.42/path/to/foo.zim
- allows ipfs to retrieve blocks contents via a URL instead of storing it in the datastore
- IPFS node does not need to run on the same box as ZIM files, or it can run as a different user that does not have access to files, and reads them only over HTTP (server must support range-requests).
- introduces some overhead and potential bandwidth cost: entire file is fetched and chunked into memory during initial ipfs add and discarded, then specific byte ranges are fetched again every time someone asks for a CID over IPFS
- I believe this is feasible as long you run IPFS box in the same LAN, and HTTP transfer happens within local network (both fast and does not introduce redundant costs for duplicated WAN traffic)

@rgaudin in both cases ipfs add should be easy to wire up with any zim detection scripts you come up with:

if you pass -Q then it will write only the final CID to the stdout.
if you pass -p then it will also print progressbar to stderr, which is handy for debugging in interactive mode
if you want to assign a human-readable label to a CID you can ipfs files cp /ipfs/{cid} /foo.zim – this will also protect it from being garbage-collected if you decide to enable GC at some point. When you no longer want to provide it to the network, you ipfs pin rm {cid} ; ipfs files rm /foo.zim

Let me know if you got any questions/issues.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

lidel commented 3 years ago

Anything I could do to help with this?

Support for remote pinning is built-in since go-ipfs 0.8.0, so I think the farm could more or les do:

ipfs add --nocopy --cid-version=1 /path/to/foo.zim → will produce {cid}
ipfs pin remote service add someService secret
ipfs pin remote add --service=someService --name=foo.zim {cid}

Step (3) will wait until entire ZIM is pinned remotely, so when its done we no longer need to run IPFS on farm.

Figuring out service for (2) is probably the first step here (we could use some sponsorship to pay for https://pinata.cloud/ or use something else – tbd)

kelson42 commented 3 years ago

@lidel We are busy with other urgent topics, sorry for the lack of reaction on our side. Just read from Jonathan that https://estuary.tech/ might help us with the pinning problem.

kelson42 commented 3 years ago

Update regarding estuary.tech by @lidel: This is a new option which does not require us to run IPFS node at all, just do HTTP POST to https://docs.estuary.tech/api-content-add and the Estuary service takes care of everything (IPFS for hot CDN cache+Filecoin for long term storage) and returns CID (content identifier) which then can be used for fetching content from IPFS (either natively, or any of public gateways).

That would solve half of our problem, the other half being the IPFS client native client lib... and this should be pretty easy to implement.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

lidel commented 2 years ago

@kelson42 I was not able to block time for this during the hackweek, but did some progress this weekend. Started fleshing out details for self-hosted strategy ((A) from https://github.com/lidel/zim2ipfs/issues/1), mostly reading about zimfarm and figuring out necessary building blocks.

My initial notes and questions in https://github.com/lidel/zim2ipfs/pull/2 (here is a readable version).

Gist is that I need some guidance / sanity check what would be the best way to plug IPFS publishing into existing infrastructure, and how to track which ZIMs have a CID, and which ones do not – see questions at the end, would appreciate feedback in your spare time.

rgaudin commented 2 years ago

@lidel ; thank you for this documentation ; it's very helpful and allowed me to get up to where you are on the subject. I don't have any answer ATM but we'll discuss it with @kelson42 as this receiver piece is supposed to change as well. Thank you for laying out the options.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

openzim / zimfarm

Publish ZIM files to IPFS #606