jupyterhub / hubshare

A directory sharing service for JupyterHub
BSD 3-Clause "New" or "Revised" License
57 stars 21 forks source link

consider existing sharing tools #14

Open minrk opened 7 years ago

minrk commented 7 years ago

I wanted to drop a note here that the best sharing experience I've seen for a notebook deployment is CERN's, which uses CERNBox, an instance of ownCloud, an open source Dropbox clone. It allows sharing, granular permission management, browsing, etc. It's much more full-featured than hubshare is intended to be. It also requires zero integration or awareness from JupyterHub or the single-user notebook. It's all in the setup of the environment in which notebooks are run.

It may make more sense in the end to build a JupyterHub deployment that works with ownCloud / Nextcloud / etc. than to build and maintain our own sharing service that's as severely scope-limited as we are planning to make hubshare. Such an integration could also have a much smoother transition for real-time collaboration, which is orthogonal to hubshare as planned.

yuvipanda commented 7 years ago

I strongly believe in this too.

On Wed, Feb 1, 2017 at 1:16 PM, Min RK notifications@github.com wrote:

I wanted to drop a note here that the best sharing experience I've seen for a notebook deployment is CERN's, which uses CERNBox http://information-technology.web.cern.ch/services/CERNBox-Service, an instance of ownCloud https://owncloud.org, an open source Dropbox clone. It allows sharing, granular permission management, browsing, etc. It's much more full-featured than hubshare is intended to be. It also requires zero integration or awareness from JupyterHub or the single-user notebook. It's all in the setup of the environment in which notebooks are run.

It may make more sense in the end to build a JupyterHub deployment that works with ownCloud / Nextcloud / etc. than to build and maintain our own sharing service that's as severely scope-limited as we are planning to make hubshare. Such an integration could also have a much smoother transition for real-time collaboration, which is orthogonal to hubshare as planned.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/hubshare/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23muwHU3806fsYFfAoBLS5qosRpv9ks5rYPYRgaJpZM4L0ZXu .

-- Yuvi Panda T http://yuvi.in/blog

jankatins commented 7 years ago

owncloud is a a php app and it would be nice to not install another stack.

We currently install a jupyter server without any user separation (just folders) because sharing isn't currently included in jupyter hub (Security is handled by a VPN). What we need is some kind of "drop it there and others can see (and maybe play with) it", so this repo looks (or looked?) very promising.

willingc commented 7 years ago

@minrk and others: Microsoft Azure has done a pretty nice job with their sharing. https://notebooks.azure.com/faq#libraries

willingc commented 7 years ago

Perhaps they would open source some of their code. Or collaborate with us here.

parente commented 7 years ago

It may make more sense in the end to build a JupyterHub deployment that works with ownCloud / Nextcloud / etc. than to build and maintain our own sharing service that's as severely scope-limited as we are planning to make hubshare.

Our current off-the-shelf setup relies on nbviewer in docker with user notebook directories mounted from local disk plus the nbexamples extension. It was quick and easy to setup, but we definitely want to go beyond it. The key features we are looking for are:

  1. Search across all shared notebooks (around 7000)
  2. Reasonable previews of notebook hits (title, summary, tags, thumbnail, ...)
  3. Rendered notebooks without having a notebook server (reports for non-notebook users)
  4. Clone a notebook of interest to the user's notebook server
  5. Click to share a notebook from the user's server
  6. Some sense of the notebook's provenance and history

I realize that's not a trivial list.

I've been looking at https://github.com/airbnb/knowledge-repo and thinking about how we might build atop it. I've also been thinking about what a notebook catalog backed by git with LFS might look like if we just slap a search index and UI on top. And, of course, I'm looking at what's being discussed here in hubshare to see if it might shape up to fit our use cases instead.

I don't have any strong recommendations or opinions to share yet. Just putting some info out in the open.

rgbkrk commented 7 years ago

I've been lurking within hubshare to see where things are going while building something a bit different in commuter with @cabhishek and other folks.

I don't wish to be prescriptive on how everyone should do their own setup. My explanation on these is mostly to paint a picture of what we have in progress and where we're heading.

Background and motivation

* Real-time sessions with themselves totally have to happen.

Commuter Roadmap

Our Roadmap lists out our stages of development going from:

  1. View notebooks via URL (from S3 underneath)
  2. Connect to kernels from a configured source (JupyterHub)
  3. Save notebooks back
  4. Create server side in-memory model of notebook and transient models, push to all clients

We plan on shipping during each of these stages fairly rapidly and getting feedback.

Publishing

For us, the act of sharing or publishing is simply editing. There is no distinction of publishing - you can share with other people on the same server immediately. Users want to share with URLs like:

https://notebooks.mysite.com/kyle/some-notebook.ipynb

On the current version, when you load a page you're using an implementation of the contents API on top of S3. If people don't have a notebook server up, they should still be able to view the content. This is enabling that.

All of this is leading us towards editing and running notebooks, using the realtime model we've been carving out. Long term is as close to Google docs style collaboration as possible.

As for indexing and discovery, we intend to maintain a separate service for indexing notebooks in elasticsearch that can be built (or re-built) from our existing S3 bucket.

Since we know full well that we may want to try different approaches to some of this, we aren't just publishing our opinionated server to npm - we're also pushing out individual React components (commuter is a monorepo that uses lerna underneath). We've got some UI mockups if you want to check those out too.


As an aside, we are steadfastly stabilizing nteract/nteract while exporting packages from inside it to use externally. The goal there being to use them in commuter as well as providing our transforms (renderer in JupyterLab parlance).

minrk commented 7 years ago

Thanks for you use cases!

I've been thinking about this as well, and another approach, inspired by conversations with @rgbkrk, is building on top of the contents API, rather than starting from scratch. The Jupyter Server proposal should make it easier to deploy a server that's just the contents API, and could be made to allow extending the contents API to add things like ownership, sharing, etc., but I think I can import contents from notebook 4.x without much difficulty.

Such an application could start with:

and that should have all of the features of our current proposal, short of the hash. The mechanism for moving files between hubshare and single-user environments would be unchanged.

benefits:

downsides relative to the current design:

I'm increasingly convinced that building on top of contents is a good idea. Enough so that I'll try to sketch out a more formal spec for how it would work, and maybe even sketch what an extended-contents server would look like, pulling from notebook 5.0.

ssanderson commented 7 years ago

@minrk, speaking as someone who's done a fair amount of work building things on top of the Contents API, I think building sharing on top of / as an extension to the Contents API sounds very plausible. At the very least, re-using the file and directory model definitions for a sharing protocol strikes me as a good idea.

For what it's worth, pgcontents already has a notion of users associated with the files it stores, and I've long been of the opinion that someone could add sharing support to it in a day or two by adding a "shared_notebooks" table or something like it. We (at Quantopian) actually use S3 as the persistent storage for shared notebooks, but that's mostly because we wanted to easily integrate with an existing non-Jupyter application (our forums), and because our sharing model is "publish this notebook to the world" rather than "share this notebook with a particular set of users". If we ever build limited user-to-user sharing, it will almost certainly be in terms of a Contents API extension.

minrk commented 7 years ago

Thanks, @ssanderson!

minrk commented 7 years ago

After various conversations this week, I think I want to separate the 'push/pull from notebook servers' task (main goal here) from the 'public index/search/discovery' task that comes up a lot when you start using words like 'share'. I'm going to explore deploying commuter behind JupyterHub as the discovery application, and then scope-limit HubShare as 'just' a multi-user contents API that users can push to / pull from. If both commuter and HubShare are talking to the same storage, then deploying them together makes sense. In particular, this may mean no UI pages served by HubShare, only the JupyterLab extension.

It seems that for lots of deployments where there is already a shared filesystem and/or a backing/syncing store, then HubShare wouldn't be needed, only commuter. I've talked to @rgbkrk about supporting local storage for commuter, which would make this cover a pretty wide set of use cases, I think.

surajzinjad commented 6 years ago

@minrk is there any way where I can share notebooks in Jupyterhub

perllaghu commented 6 years ago

We are running in an OpenStack cloud, so I wrote a Notebook ContentsManager plugin - and whilst the tree view would store files using SwiftStore, the problem was when a user want to do os.open(...) within the notebook (to access the data given for the course-work) - that goes straight to the local disk.

If one is trying to supply persistent data-storage, we've found NFS as the only viable tool for off-machine storage. (We also tried using cinder as a block-storage medium.... not particularly successfully)

I think it's important not to lose sight of users persistent store, over and above just handling .ipynb files.

arfon commented 5 years ago

I wanted to drop a note here that the best sharing experience I've seen for a notebook deployment is CERN's, which uses CERNBox, an instance of ownCloud, an open source Dropbox clone. It allows sharing, granular permission management, browsing, etc. It's much more full-featured than hubshare is intended to be. It also requires zero integration or awareness from JupyterHub or the single-user notebook. It's all in the setup of the environment in which notebooks are run.

@minrk - are there any good write ups of what the folks at CERN did here? I'm interested in some kind of Dropbox and/or Google Drive integration with our JupyterHub deployment here at STScI and searching for existing implementations landed me on this issue :-)

ablekh commented 5 years ago

@arfon Here are some relevant materials that I ran across (CERNBox documentation is sparse currently):

@minrk I would suggest using NextCloud instead of ownCloud (the former is a clone of the latter) due to a more liberal software license (enterprise features are open source in NextCloud, unlike in ownCloud -- this requires some additional investigation; I suspect that a small number of ownCloud's enterprise features are not present within NextCloud's codebase, e.g., Workflows) as well as arguably more dynamic and larger open source community.

willingc commented 5 years ago

@arfon @ablekh Thanks for passing along this info. I agree it does seem like a promising approach.

@minrk As an FYI, there was a keynote at KubeCon in Copenhagen by CERN and the speaker may be a good person to contact. Also copying @betatim for your contacts in CERN and that you would find this interesting too.

willingc commented 5 years ago

@rgbkrk ^^

betatim commented 5 years ago

Unfortunately my level of contacts with the people at CERN who work on SWAN (what they call their JupyterHub deployment I think) is not enough to persuade them to make more public documentation and take a bigger part in the JupyterHub community.

From what I gather they used FUSE to mount owncloud directories in the docker containers that get launched by users. This makes it easy to sync (or removes the need for it) work between your laptop, your hub home directory and other places. Not sure how/if it solves the problem of sharing between users.

Plug: Because sync'ing files between my "home directory" on a hub that runs on docker/kubernetes (can't use ssh to copy files) and my local machine is so tedious I've been investigating syncthing (binder demo https://github.com/betatim/binder-syncthing). It works (files are sync'ed two ways) but after about 60s in the web UI of syncthing an error message pops up. It is related to trying to auto update syncthing (I think), if anyone has ideas on how to disable that check I'd love to know.

elgalu commented 5 years ago

@georghildebrand found a solution for syncing files between your laptop, your hub home directory and other places. Georg could you fill in your approach? I remember was based on some P2P open source solution?

beenje commented 5 years ago

For those interested, I maintain at work a JupyterHub instance and we use Nextcloud to sync the notebooks between the JupyterHub server and the users computer.

This allows users to synchronise their notebooks and data between JupyterHub and their computer using NextCloud.

Note that this doesn't allow to share files between users :-( Only files put directly in the user nextcloud directory are visible. Shared links, groups are not kept under /opt/nextcloud/html/data/<username> and can't be seen.

So I'm still looking for an easy way for users to share notebooks together.

Still not at the integration level I'd like to have.

betatim commented 5 years ago

Have you looked at https://github.com/OpenHumans/jupyter-gallery (deployed: https://exploratory.openhumans.org/) I think OpenHumans is pretty happy with it as a way for users to share notebooks with others and discover what others have shared.

ellisonbg commented 5 years ago

I posted a comment on #17 but also wanted to record part of it here.

A key requirement of hubshare is treating directories as immutable, and being able to track through a directory's lineage, through a linked list of shas (both from the versioning and sharing perspective), and to have different permissions of different shas. This requirement came from our experience trying to handle homework distribution and collection for nbgrader, but is also relevant in other contexts.

The focus on directories also comes from that context, as well as the experience from binder that directories are a good "unit of reproducibility".

I think the idea that @minrk has pitched before (of content manager + permissions) would also be very useful, but I don't yet see how to square that with the requirements of hubshare than is currently spec'd out in the REST API in this repo. Maybe they are separate things? Maybe there is a way to unify them?

From a philosophical perspective - one of the things that Jupyter has been good at is innovating on open protocols and standards, and encouraging the broader community to build around and on top of those things. I view hubshare (or more generally - protocols for document/directory sharing) as one of those things that is important enough for Jupyter to take our time and build it in a way that lives up to the vision we have established in our other standards (kernel protocol, jupyter server REST APIs, notebook document format, parts of JupyterHub, etc.).