Closed d70-t closed 2 years ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Thank you for opening this, @d70-t! It was very nice to meet you at the zarr / IPFS meeting yesterday. /cc @rabernat who introduced and invited me to it, and @bollwyvl who is the ardent IPFS fan in the Jupyter community.
Currently, we explicitly allow only a certain set of ports for outgoing network traffic (https://github.com/jupyterhub/mybinder.org-deploy/blob/a988f3e74c0d07e738724b26c08a579f3da65653/mybinder/values.yaml#L49), to prevent abuse. IPFS is definitely very useful tech that I'm glad is being explored along with Zarr, and I want to try enable use of that on mybinder.org if possible. Opening outgoing port 4001 TCP should not be a problem. I also checked to see if it will open us up to more cryptocurrency abuse (running a free compute service is now very difficult thanks to people abusing it to run miners), but I don't think it will change our security posture there in any way.
However, the question is wether just opening up port 4001 outgoing is enough, or if incoming connections also need to be allowed. Incoming connections aren't really possible as we're really in an environment with a couple of NATs between the IPFS process and the open internet.
So, do you know how we can test this, @d70-t?
https://github.com/jupyterhub/mybinder.org-deploy/pull/2070 should open the outbound port - one option is to just try that and see if it works.
I'm not familiar with IPFS, but it soudns like HTTP gateways are available: https://docs.ipfs.io/how-to/address-ipfs-on-web/#http-gateways Is this sufficient for downloading data to mybinder.org?
Thanks @yuvipanda and @manics for having a look into this. I firmly believe that opening the outgoing port should be sufficient. As the goal would be to retrieve data from other peers, not to serve data to others, the node on binder needs to be able to connect to others, but others don't need to connect to the node on binder.
For checking, it should be sufficient to start up my testing repo, open a console window and try if some basic IPFS commands work. In particular, I'd check if the node is able to connect to any peers by issuing
ipfs swarm peers
and to check if some files may be retrieved, e.g.:
ipfs cat /ipfs/QmSgvgwxZGaBLqkGyWemEDqikCqU52XxsYLKtdy3vGZ8uq > spaceship-launch.jpg
should download an image.
For the HTTP Gateways: yes, that's a current workaround, but which has several drawbacks. Using HTTP (especially over the Network) kind of defeats the purpose of IPFS (although it is a helpful transitioning technology). The goal of the IPFS protocol is to retrieve data based on its content-id from any server which is able to provide that dataset. As such, it connects to many (a configurable amount) peers to ask them for data or advice where to find the data. One particular benefit of that is that one removes any single point of failure in the data retrieval step.
For choosing HTTP gateways, there will be two options: using a public one or using a private / self-hosted / self-operated one. The public gateways can more easily be overloaded and tend to return data way slower than any private / local node would do. Also it would introduce further dependencies on services which explicitly state that one should not rely on their operation. Hosting an own gateway would speed things up (one could make it prefer own data). But operating it is not a trivial task, as one has to choose between providing everything (which enables the use of datasets of other colleagues) and only provide own data (which will save you from accidentally re-providing illegal content). Both options are not ideal. This problem disappears when accessing the data via the IPFS protocol, because there is not intermediary which re-provides (unknown) data. Instead, the requesting node would be forwarded directly to the machine of the colleague to request data from there.
In the case of mybinder.org, I'd see the benefit in quickly starting up an environment which "just works". In that context, I would argue that having datasets delivered quickly and having the option to experiment with other, similar datasets are both very valuable and that's why I would prefer using the IPFS protocol directly over using HTTP on the "first mile".
Wooop woop 🚂 it works like a charm:
(I've updated the testing-repo to polish things a bit more...)
Awesome stuff.
We've got go-ipfs up at conda-forge, so it would be quite easy for folk to try it out by normal means, especially with jupyter-server-proxy. This would get one to a dashboard.
Effective use in binder might increase ipfs' value in mamba, where @wolfv had been considering it as a candidate for swappable backends... Unfortunately, there are some information theory problems (one can't just turn a shasum into an ipfs CID) but they really would be a lovely match.
https://docs.filecoin.io/mine/hardware-requirements/#general-hardware-requirements
Thanks for the input @bollwyvl! I'd love seeing IPFS being more commonly used. That would greatly reduce the amount of convincing necessary to get people involved :-)
So I'm not yet very much into conda nor mamba. But if there are simpler ways of having an IPFS node up and running that what's shown above, I'd strongly appreciate more input on that. Not only on binder but also e.g. in HPC environments.
There's also a related thread at pangeo-forge, where we're thinking about other uses of zarr datasets on top of IPFS.
... if packages, scripts and data would be served across IPFS (or other content addressable systems), it might really be possible to have reproducible computing environments at relatively little extra cost. Very exciting...
Dear mybinder people,
I'm a happy user of binder and like the possibility to quickly open an interactive notebook, especially for teaching. There has been a field campaign with many participating people and institutions, creating a larger amount of atmospheric and oceanographic datasets which we hope to make more visible and usable by creating the How to EUREC4A online book, mainly consisting of notebooks. We want to give users the opportunity to quickly interact with the book using the magnificent binder project.
We also face the issue that our datasets are scattered around the world and due to various reasons, it is not easy to collect all of them at a single place as well as to take care that datasets are not modified by accident. Furthermore, the currently used servers are unreachable from time to time, which severely degrades the user experience. To solve these issues, we experiment on using IPFS, a content addressable distributed file system, to hold our data (as zarr). IPFS is a peer to peer system which makes it possible for everyone to host copies of the data, which should make data retrieval more reliable and enables us to keep the data distributed (which we like).
So now to the issue: I've been trying to set up a sample repository which should run an IPFS node within a binder environment such that I would be able to retrieve datasets stored on IPFS from within binder using the native IPFS protocol (which enables the desired redundancy in dataset retrieval opportunities). As far as I can tell, the IPFS node was able to start but could not connect to any peers. I assume that the reason for this is that outgoing traffic on Port 4001 (TCP and UDP) is not allowed on binder.
What do you think about accessing IPFS content from binder? Would it be possible to open up port 4001 for this purpose? Do you see other possibilities to access IPFS data?
I'm tagging @yuvipanda, as we've briefly talked about this before.