Open kelson42 opened 3 years ago
@lidel Would you be able please to explain briefly how the zimfarm-receiver Docker container could deal with the IPFS-node (to be installed on download.kiwix.org) to get the file published on IPFS?
Just to be sure, there is no uploading to IPFS per say. We'd run an IPFS node on download.kiwix.org which currently is our storage repository.
On that server, we'd need piece of software to decide to expose/publish zim files on IPFS. This decision would probably be based on the zimcheck output I think
So what we need is this decision tool and we'd need it to report-back to the zimfarm about the hash/url for files. Main difficulty is to retrieve the TaskID of a file as this is a concept we don't have on the zimfarm.
@rgaudin considering that the zimfarm should not have the duty to publish or not, but I would like the zimfarm to know and share the information about all the URLs to access the content, I believe that everything out of the zimfarm should be published in IPFS local node.
OK, makes sense. We do have URLs for stuff that are not published
So I believe you want to run go-ipfs somewhere. It works fine in Docker (docs) if that makes things easier, but it's just a single binary, so whatever works best for you.
Given the size of ZIMs importing data to IPFS datastore (flatfs/badger) is not an option as we want to avoid duplication. Luckily, there are two ways of avoiding data duplication and still publish data over IPFS network:
ipfs config --json Experimental.UrlstoreEnabled true
→ restartipfs daemon
, then:ipfs add --nocopy --cid-version=1 /path/to/foo.zim
- allows files to be added without duplicating the space they take up on disk
- requires IPFS node to have access to file over local filesystem
- more efficient than urlstore below
ipfs config --json Experimental.FilestoreEnabled true
→ restartipfs daemon
, then:ipfs add --nocopy --cid-version=1 http://192.168.0.42/path/to/foo.zim
- allows ipfs to retrieve blocks contents via a URL instead of storing it in the datastore
- IPFS node does not need to run on the same box as ZIM files, or it can run as a different user that does not have access to files, and reads them only over HTTP (server must support range-requests).
- introduces some overhead and potential bandwidth cost: entire file is fetched and chunked into memory during initial
ipfs add
and discarded, then specific byte ranges are fetched again every time someone asks for a CID over IPFS- I believe this is feasible as long you run IPFS box in the same LAN, and HTTP transfer happens within local network (both fast and does not introduce redundant costs for duplicated WAN traffic)
@rgaudin in both cases ipfs add
should be easy to wire up with any zim detection scripts you come up with:
-Q
then it will write only the final CID to the stdout.-p
then it will also print progressbar to stderr, which is handy for debugging in interactive modeipfs files cp /ipfs/{cid} /foo.zim
– this will also protect it from being garbage-collected if you decide to enable GC at some point. When you no longer want to provide it to the network, you ipfs pin rm {cid} ; ipfs files rm /foo.zim
Let me know if you got any questions/issues.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
Anything I could do to help with this?
Support for remote pinning is built-in since go-ipfs 0.8.0, so I think the farm could more or les do:
ipfs add --nocopy --cid-version=1 /path/to/foo.zim
→ will produce {cid}
ipfs pin remote service add someService secret
ipfs pin remote add --service=someService --name=foo.zim {cid}
Step (3) will wait until entire ZIM is pinned remotely, so when its done we no longer need to run IPFS on farm.
Figuring out service for (2) is probably the first step here (we could use some sponsorship to pay for https://pinata.cloud/ or use something else – tbd)
@lidel We are busy with other urgent topics, sorry for the lack of reaction on our side. Just read from Jonathan that https://estuary.tech/ might help us with the pinning problem.
Update regarding estuary.tech by @lidel: This is a new option which does not require us to run IPFS node at all, just do HTTP POST to https://docs.estuary.tech/api-content-add and the Estuary service takes care of everything (IPFS for hot CDN cache+Filecoin for long term storage) and returns CID (content identifier) which then can be used for fetching content from IPFS (either natively, or any of public gateways).
That would solve half of our problem, the other half being the IPFS client native client lib... and this should be pretty easy to implement.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
@kelson42 I was not able to block time for this during the hackweek, but did some progress this weekend. Started fleshing out details for self-hosted strategy ((A) from https://github.com/lidel/zim2ipfs/issues/1), mostly reading about zimfarm and figuring out necessary building blocks.
My initial notes and questions in https://github.com/lidel/zim2ipfs/pull/2 (here is a readable version).
Gist is that I need some guidance / sanity check what would be the best way to plug IPFS publishing into existing infrastructure, and how to track which ZIMs have a CID, and which ones do not – see questions at the end, would appreciate feedback in your spare time.
@lidel ; thank you for this documentation ; it's very helpful and allowed me to get up to where you are on the subject. I don't have any answer ATM but we'll discuss it with @kelson42 as this receiver piece is supposed to change as well. Thank you for laying out the options.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
We are moving forward to use IPFS as (part of) distribution strategy.
Therefore we need to publish (part of) all of our ZIM files on IPFS.
The uploading of ZIM files is usually the duty of Zimfarm, the "CMS" been then in charge to decide if we put them in the official published library (or not).
The point here is that there is a difficulty because:
Therefore I believe the zimfarm-receiver Docker container which runs on download.kiwix.org should somehow require the IPFS-node Docker container to read/share the new ZIM file and then give back the IPFS URL.
Maybe we could achieve to get the ZIM file hashed at zimfarm-receiver demand and then been put online (sharing within the DHT network) one time the CMS decided to publish the file.