openzim / zimfarm

Farm operated by bots to grow and harvest new zim files
https://farm.openzim.org
GNU General Public License v3.0
84 stars 25 forks source link

Publish ZIM files to IPFS #606

Open kelson42 opened 3 years ago

kelson42 commented 3 years ago

We are moving forward to use IPFS as (part of) distribution strategy.

Therefore we need to publish (part of) all of our ZIM files on IPFS.

The uploading of ZIM files is usually the duty of Zimfarm, the "CMS" been then in charge to decide if we put them in the official published library (or not).

The point here is that there is a difficulty because:

Therefore I believe the zimfarm-receiver Docker container which runs on download.kiwix.org should somehow require the IPFS-node Docker container to read/share the new ZIM file and then give back the IPFS URL.

Maybe we could achieve to get the ZIM file hashed at zimfarm-receiver demand and then been put online (sharing within the DHT network) one time the CMS decided to publish the file.

kelson42 commented 3 years ago

@lidel Would you be able please to explain briefly how the zimfarm-receiver Docker container could deal with the IPFS-node (to be installed on download.kiwix.org) to get the file published on IPFS?

rgaudin commented 3 years ago

Just to be sure, there is no uploading to IPFS per say. We'd run an IPFS node on download.kiwix.org which currently is our storage repository.

On that server, we'd need piece of software to decide to expose/publish zim files on IPFS. This decision would probably be based on the zimcheck output I think

So what we need is this decision tool and we'd need it to report-back to the zimfarm about the hash/url for files. Main difficulty is to retrieve the TaskID of a file as this is a concept we don't have on the zimfarm.

kelson42 commented 3 years ago

@rgaudin considering that the zimfarm should not have the duty to publish or not, but I would like the zimfarm to know and share the information about all the URLs to access the content, I believe that everything out of the zimfarm should be published in IPFS local node.

rgaudin commented 3 years ago

OK, makes sense. We do have URLs for stuff that are not published

lidel commented 3 years ago

So I believe you want to run go-ipfs somewhere. It works fine in Docker (docs) if that makes things easier, but it's just a single binary, so whatever works best for you.

Given the size of ZIMs importing data to IPFS datastore (flatfs/badger) is not an option as we want to avoid duplication. Luckily, there are two ways of avoiding data duplication and still publish data over IPFS network:

@rgaudin in both cases ipfs add should be easy to wire up with any zim detection scripts you come up with:

Let me know if you got any questions/issues.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

lidel commented 3 years ago

Anything I could do to help with this?

Support for remote pinning is built-in since go-ipfs 0.8.0, so I think the farm could more or les do:

  1. ipfs add --nocopy --cid-version=1 /path/to/foo.zim → will produce {cid}
  2. ipfs pin remote service add someService secret
  3. ipfs pin remote add --service=someService --name=foo.zim {cid}

Step (3) will wait until entire ZIM is pinned remotely, so when its done we no longer need to run IPFS on farm.

Figuring out service for (2) is probably the first step here (we could use some sponsorship to pay for https://pinata.cloud/ or use something else – tbd)

kelson42 commented 3 years ago

@lidel We are busy with other urgent topics, sorry for the lack of reaction on our side. Just read from Jonathan that https://estuary.tech/ might help us with the pinning problem.

kelson42 commented 3 years ago

Update regarding estuary.tech by @lidel: This is a new option which does not require us to run IPFS node at all, just do HTTP POST to https://docs.estuary.tech/api-content-add and the Estuary service takes care of everything (IPFS for hot CDN cache+Filecoin for long term storage) and returns CID (content identifier) which then can be used for fetching content from IPFS (either natively, or any of public gateways).

That would solve half of our problem, the other half being the IPFS client native client lib... and this should be pretty easy to implement.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

lidel commented 2 years ago

@kelson42 I was not able to block time for this during the hackweek, but did some progress this weekend. Started fleshing out details for self-hosted strategy ((A) from https://github.com/lidel/zim2ipfs/issues/1), mostly reading about zimfarm and figuring out necessary building blocks.

My initial notes and questions in https://github.com/lidel/zim2ipfs/pull/2 (here is a readable version).

Gist is that I need some guidance / sanity check what would be the best way to plug IPFS publishing into existing infrastructure, and how to track which ZIMs have a CID, and which ones do not – see questions at the end, would appreciate feedback in your spare time.

rgaudin commented 2 years ago

@lidel ; thank you for this documentation ; it's very helpful and allowed me to get up to where you are on the subject. I don't have any answer ATM but we'll discuss it with @kelson42 as this receiver piece is supposed to change as well. Thank you for laying out the options.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.