jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 360 forks source link

Add an IPFS content provider #1096

Closed yuvipanda closed 1 day ago

yuvipanda commented 2 years ago

Proposed change

IPFS is a content addressable global 'file system' that can share directories using an immutable, globaly unique content ID. It can be used to store code as well as data. There are experiments in the pydata / zarr ecosystem on using it for some datasets as well https://github.com/pangeo-forge/roadmap/issues/40.

I'd like us to add an IPFS content provider, perhaps using https://github.com/fsspec/ipfsspec as the provider backend. When given an IPFS content ID, we can just download that directory and let repo2docker do its normal thing. As content ids are immutable, this fits pretty well with what we wanna do.

Who would use this feature?

Ideally, this would eventually end up on mybinder.org and other binderhubs. IPFS can be a distributed alternative to storing code and content, vs something centralized like GitHub.

How much effort will adding it take?

I'd say most of the work would happen in https://github.com/fsspec/ipfsspec, and might already be done. Otherwise, I suspect it'll be minimal effort.

Who can do this work?

Some IPFS enthusiast, maybe :) Some basic understanding of IPFS concepts maybe necessary to fully implement this.

yuvipanda commented 2 years ago

/cc @d70-t, @rabernat (and others? idk). This would be the first step of letting a binder pull in content from IPFS directly.

d70-t commented 2 years ago

If a content provider is mainly about getting a folder, then maybe

ipfs get <CID> -o <target_folder>

would do the trick (here's the doc) The catch would be that you'd need to have a running ipfs node on the machine executing the ipfs get. But probably that's not too hard of a requirement 🤷 .


ipfsspec could also be used to retrieve content via gateways on other machines, however ipfsspec is more about getting the content into Python, not really to write it back out on disk (though I assume that should be easy to put on top).


Another option might become the upcoming ?format=car option (and probably also tar etc...), which would make gateways support streaming out an entire graph behind a cid in stead of just single files.

yuvipanda commented 2 years ago

I guess this would require the ipfs binary to be present. I do think that most users of repo2docker will not have a daemon running locally, so falling back to gateways is quite important. ?format=car sounds like a good option - do you know if it is actually being implemented right now? I can't quite tell from that issue what the status of that is.

d70-t commented 2 years ago

So I think it is being developed, but its more like a refactoring step which requires quite a bit of coordination. it is scheduled for the next next release. But the functionality seems to be there already, you can use:

E.g. you could use

curl -v -L "https://ipfs.io/api/v0/get?arg=QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG" > quickstart.tar

to obtain the quickstart folder from the tutorial using the public gateway at ipfs.io.

Likewise you'd obtain the same on your local gateway using:

curl -v -L "http://127.0.0.1:8080/api/v0/get?arg=QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG" > quickstart.tar
d70-t commented 2 years ago

I'm just thinking about which kinds of benefits an IPFS content provider would also have. One thing which might be interesting would be to use the CID of the binder or .binder folder as a cache key for built docker images. That way, it might be possible to automatically built images less often.

manics commented 1 day ago

Closing, see https://github.com/jupyterhub/repo2docker/pull/1098#issuecomment-2179631478