jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

peer2peer data hosting #625

Open betatim opened 6 years ago

betatim commented 6 years ago

This issue is more philosophical/discussion orientated than actionable (I think).

People want to use "data" on mybinder.org (or any Binder deployment) and currently we offer no particular integration ("use postBuild to fetch it" or "use your data's API from a notebook") and we block things like FTP. If people started using mybinder.org with seriously large datasets (fetching all of it and then subsampling on mybinder.org) it will increase our bandwidth costs.

All this made me think: is there a good peer2peer system that can be exposed as a posix filesystem that we can make available to each binder? Ideally one where data is only fetched if you actually read a file, not just listing directories. For popular datasets chances are we would have a copy on an active binder already and can transfer it internally. It would also allow people to access data that is otherwise on FTP by somehow adding it to this distributed filesytem.

My questions:

choldgraf commented 6 years ago

hmmm, this also feels like one of those "you should only get XXX amount of I/O bandwidth per binder session if it's free" kinda things, as you're right that we could get slammed with a ton of traffic if a super popular and gigantic dataset started getting downloaded by lotsa people.

technically, I'm not really sure...though P2P downloads feels like it could have security issues, no?