jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.6k stars 359 forks source link

r2d2p2p: Distributed filesystems for caching and reproducibility? #865

Open bollwyvl opened 4 years ago

bollwyvl commented 4 years ago

Proposed change

Where might hooks be put in r2d to get some benefits from p2p networks?

A couple years back, @yuvipanda and I kicked around some ideas for leveraging distributed/p2p filesystems to make certain artifacts (notebooks, at the time) more durable/redistributable. At the time, we were looking at dat and IPFS, both of which are still kicking.

Spitballing around IPFS: In the container space, recently netflix put some :muscle: behind using IPFS for docker layers. This dovetails with, e.g. the Cloudflare gateway. So instead of just pushing containers to a registry, they could be also be added to ipfs, and as long as one node is holding them, CF will apparently foot the CDN bill for free.

In addition to built container layers, the pieces from which the layers themselves were built could be cached. For example, freeze.py, instead of just remembering which packages/version/builds were used, could

Alternative options

bittorent, dat or other approaches would also work, but carry some different constraints, and have less work happening around them in the ops space.

Who would use this feature?

Maybe nobody? This is probably not an "always-on" kind of thing, and might only make sense for certain kinds of deployments. So if the hooks existed, the notional r2d2ipfs would be something you could optionally install and configure.

But maybe if any of dockerhub, gcr.io, anaconda.org, cran, or pypi were to go down for an extended period of time, as long as one node was hosting the stuff used to build all the binders some organization cared about, they'd still have a pretty good slice of what they needed to reconstitute their research, business, dissertation, etc.

If any of this worked, it would likely be a compelling feature to roll into a binder-at-home desktop appliance for the distribution of some of these very large artifacts, which would probably improve their performance on a happy day, as well as further mitigating risk around shared/proprietary infrastructure in the face of rainy days.

How much effort will adding it take?

From an API perspective, most of the p2p tools have tried to make it pretty straightforward to add and retrieve things by shelling out, so it's really a matter of deciding where such a thing would live, though freeze.py, push seem like likely initial places to add hook. I guess individual buildpacks could also grow the capability to cache intermediate artifacts: in places where caches are cleared after an install, etc. IPFS hashes could be calculated/listed (so they became part of the log/layer), if not files actually pushed someplace from inside the build (sounds slow).

Who can do this work?

There may be some opportunities to work with https://github.com/ipfs/package-managers to show an end-to-end use case of using IPFS for immutable, redstributable ops.

manics commented 4 years ago

I've got an open PR for abstracting the interface to the container engine: https://github.com/jupyter/repo2docker/pull/848 (parent issue https://github.com/jupyter/repo2docker/issues/682)

On one hand this could make it easier to implement IPFS layer caching in the form of an alternative engine plugin. On the other hand it might complicate things if you wanted to add distributed caching inside the buildpacks.

betatim commented 4 years ago

Is there some code we can run to see the IPFS layer distribution in action? I can't quite image how it would work for the user (would you use a different dockerd or ...?).

One reason we currently don't give the public access to the docker registry of our GKE cluster is that we are afraid of the huge(?) network egress bill we'd generate. Just as yet another thing we'd have to solve/might motivate this.


A little off-topic from r2d and p2p image distribution: I spent a bit of time about half a year ago or so looking at https://github.com/uber/kraken as a way to speed up image distribution within the cluster. I never figured out a really smooth/nice way to test drive this though. Also ended up unsure if it would actually speed stuff up for us or not (didn't benchmark what was the bottleneck for a cluster operator)

bollwyvl commented 4 years ago

Is there some code we can run to see the IPFS layer distribution in action?

again, this issue is more about where r2d might become extensible for this kind of stuff... IPFS is a stand-in for "something p2p". I don't think the ipfs release has dropped yet with the specific docker support.

network egress bill we'd generate

right: I don't think the public binder federation is one of the deployments where this would make sense. Really, you'd want a peer-grade node-of-last-resort, e.g. a CDN, a cloud provider, an ISP, a research university library working with their IT, to support a big public install.

speed up image distribution

while this is what netflix is after, from a systems perspective, I see p2p as a resilience mechanism. This is why I am particularly interested in harvesting less-than-layer sized pieces. For example: caching a python wheel or an npm tarball is interesting... but being inside an archive implicitly raises the level of cache misses. ipfs supports transparently unpacking tarballs, but i'd really like to know what an alternate scheme that recursively unpacked all archives (and could transparently re-combine them) would do. basically, free delta compression across the whole network, to say nothing of per-version-of-a-package deltas.

however, it does look like benchmarking is a thing the netflix effort takes seriously. So certainly wait and see when that ipfs release comes out makes sense.