Closed Analect closed 4 years ago
\o/ Thank you for your well thought out questions! I want to acknowledge I've seen them, but am travelling presently - will respond in bits and pieces!
On Thu, Jan 5, 2017 at 2:56 AM, Analect notifications@github.com wrote:
@yuvipanda https://github.com/yuvipanda Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.
[image: image] https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png
My efforts so far, for context: I was working through the data-8/jupyterhub-k8s https://github.com/data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile https://github.com/data-8/jupyterhub-k8s/blob/master/user/Dockerfile and their hub image https://github.com/data-8/jupyterhub-k8s/blob/master/hub/Dockerfile to work. It was constantly stuck in a Waiting: ContainerCreating state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.
Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 https://github.com/yuvipanda/jupyterhub-simplest-k8s/blob/master/singleuser/Dockerfile user image.
FROM jupyterhub/jupyterhub-onbuild:0.7.1
Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \ oauthenticator==0.5.* \ git+https://github.com/derrickmar/kubespawner \ git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git ADD jupyterhub_config.py /srv/jupyterhub_config.py ADD userlist /srv/userlist WORKDIR /srv/jupyterhub EXPOSE 8081 CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl
This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under /home, which is where the binding to the volume https://github.com/data-8/jupyterhub-k8s/blob/master/hub/jupyterhub_config.py#L33-L47 is happening.
i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the userid appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?
Currently, I'm using a whitelist userlist for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup https://github.com/jupyterhub/kubespawner/commit/13edc761448f21b23f13d5b26b705b41c83b8c15 and just wanted to get a better understanding around the context of this solution?
Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set() c.Authenticator.admin_users = admin = set() c.JupyterHub.admin_access = True pwd = os.path.dirname(file) with open(os.path.join(pwd, 'userlist')) as f: for line in f: if not line: continue parts = line.split() name = parts[0] whitelist.add(name) if len(parts) > 1 and parts[1] == 'admin': admin.add(name)
ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 https://github.com/jupyterhub/kubespawner/issues/14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk https://github.com/minrk was able to confirm over here https://github.com/jupyterhub/jupyterhub-deploy-docker/issues/25#issuecomment-260932976 that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this https://github.com/jupyterhub/jupyterhub/issues/891 issue, as to whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here https://github.com/sveesible/jupyterhub-kubernetes-spawner/blob/master/kubernetespawner/spawner.py#L174-L214. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo http://kubernetes.io/docs/user-guide/volumes/#gitrepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).
iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath http://kubernetes.io/docs/user-guide/volumes/#using-subpath functionality?
c.KubeSpawner.volumes = [ { 'name': 'volume-{username}-{repo-namespace}-{repo-name}', 'persistentVolumeClaim': { 'claimName': 'claim-{username}-{repo-namespace}-{repo-name}' } } ]
iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop http://kubernetes.io/docs/user-guide/container-environment/#container-hooks hooks could be used to commit and push changes prior to a container shutting down.
Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?
Thanks for your patience in reading through this!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/kubespawner/issues/18, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23qmnLmR_H-oyusHajCy23r-FFKaNks5rPMxZgaJpZM4LbkgJ .
-- Yuvi Panda T http://yuvi.in/blog
@yuvipanda ... just wondering if you've had any time to think about some of the items raised above. Much appreciated.
Yes! I have a drafted a response! Will hopefully complete in a few hours. Thanks for your patience!
On Jan 14, 2017 6:16 PM, "Analect" notifications@github.com wrote:
@yuvipanda https://github.com/yuvipanda ... just wondering if you've had any time to think about some of the items raised above. Much appreciated.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/kubespawner/issues/18#issuecomment-272622054, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23i-FJuQyNQvUJX1wN18HaUVXz0NGks5rSMOJgaJpZM4LbkgJ .
On Thu, Jan 5, 2017 at 4:26 PM, Analect notifications@github.com wrote:
@yuvipanda https://github.com/yuvipanda Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.
[image: image] https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png
This is an awesome sketch! May I ask how you created it?
https://cloud.githubusercontent.com/assets/4063815/21677128/9bc79b9c-d330-11e6-85a5-f8602b0bbff1.png
My efforts so far, for context: I was working through the data-8/jupyterhub-k8s https://github.com/data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile https://github.com/data-8/jupyterhub-k8s/blob/master/user/Dockerfile and their hub image https://github.com/data-8/jupyterhub-k8s/blob/master/hub/Dockerfile to work. It was constantly stuck in a Waiting: ContainerCreating state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.
Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 https://github.com/yuvipanda/jupyterhub-simplest-k8s/blob/master/singleuser/Dockerfile user image.
FROM jupyterhub/jupyterhub-onbuild:0.7.1
Install kubespawner and its dependencies
RUN /opt/conda/bin/pip install \ oauthenticator==0.5.* \ git+https://github.com/derrickmar/kubespawner \ git+https://github.com/yuvipanda/jupyterhub-nginx-chp.git ADD jupyterhub_config.py /srv/jupyterhub_config.py ADD userlist /srv/userlist WORKDIR /srv/jupyterhub EXPOSE 8081 CMD jupyterhub --config /srv/jupyterhub_config.py --no-ssl
This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under /home, which is where the binding to the volume https://github.com/data-8/jupyterhub-k8s/blob/master/hub/jupyterhub_config.py#L33-L47 is happening.
Awesome! In the last week or so, I've spent a lot of time generalizing the helm configuration a lot more, and it should be more widely usable (with multiple authenticators support) soon. We're deploying it for UC Berkeley's class starting Monday, so will have more time to actually write documentation after that. I intend to get it included in github.com/kubernetes/charts eventually, to make it an officially supported way of installing JupyterHub.
i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the userid appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?
Currently, I'm using a whitelist userlist for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup https://github.com/jupyterhub/kubespawner/commit/13edc761448f21b23f13d5b26b705b41c83b8c15 and just wanted to get a better understanding around the context of this solution?
Whitlelist users and admins
c.Authenticator.whitelist = whitelist = set() c.Authenticator.admin_users = admin = set() c.JupyterHub.admin_access = True pwd = os.path.dirname(file) with open(os.path.join(pwd, 'userlist')) as f: for line in f: if not line: continue parts = line.split() name = parts[0] whitelist.add(name) if len(parts) > 1 and parts[1] == 'admin': admin.add(name)
There are multiple types of users / userids, which is confusing!
c.KubeSpawner.singleuser_uid
. These users are
what is used for permission checks (writing things to persistent storage
for example - this is what was causing permission errors when writing to
the mounted persistent volume). fsgroup is related to this as well - it
should be set to a group that this unix user is part of so that singleuser
servers can mount and write to persistent volumes properly. In Kubernetes,
this should ideally just always be one unix user that's the same for all
users - they're all contained in containers, so this is ok.As for deleting PVs - if you delete PVs you lose the data in them (since dynamically provisioned PVs always have reclaimPolicy: Delete). Hence it is a manual operation that is not automated at all - you have to delete the linked PVC manually, which will delete the PV (and lose your data)
ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 https://github.com/jupyterhub/kubespawner/issues/14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk https://github.com/minrk was able to confirm over here https://github.com/jupyterhub/jupyterhub-deploy-docker/issues/25#issuecomment-260932976 that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this https://github.com/jupyterhub/jupyterhub/issues/891 issue, as to whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here https://github.com/sveesible/jupyterhub-kubernetes-spawner/blob/master/kubernetespawner/spawner.py#L174-L214. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo http://kubernetes.io/docs/user-guide/volumes/#gitrepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).
This can be done currently with https://jupyterhub.readthedocs.io/en/latest/spawners.html#spawner-options-form. Are you thinking of the list of images as being static (ie specified by administrator) or dynamic? If dynamic it might be a little more difficult, but not impossible. I see you've already dug into this on Gitter - would love to see your solution so we can make it easier in KubeSpawner :)
iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath http://kubernetes.io/docs/user-guide/volumes/#using-subpath functionality?
c.KubeSpawner.volumes = [ { 'name': 'volume-{username}-{repo-namespace}-{repo-name}', 'persistentVolumeClaim': { 'claimName': 'claim-{username}-{repo-namespace}-{repo-name}' } } ]
This is a little more difficult from JupyterHub but active work is being done on this right now - follow https://github.com/jupyterhub/jupyterhub/issues/766 for more details!
iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop http://kubernetes.io/docs/user-guide/container-environment/#container-hooks hooks could be used to commit and push changes prior to a container shutting down.
Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?
Thanks for your patience in reading through this!
If you are using GitHub for authentication, then we could possibly do something like generate a personal access token when the user logs in and then put it in an appropriate place on the notebook container, thus allowing users to pull / push natively. I think that's far better than wrapping git with some magic, which in my experience ends badly always. In https://github.com/yuvipanda/paws/blob/master/hub/jupyterhub_config.py#L41 I pass extra generated parameters into the single-user notebook from the hub, and we could do something similar here.
Action items from here are:
Feel free to ask follow up questions here or on gitter! Looking forward to seeing what cool things you are doing!
-- Yuvi Panda T http://yuvi.in/blog
@yuvipanda . Thanks for your responses.
This is an awesome sketch! May I ask how you created it?
I think you're going to be disappointed when I tell you powerpoint!
Yes, I've seen a flurry of activity cleaning up the data-8 implementation, which looks great. It would be nice to get an implementation under github.com/kubernetes/charts
Ref {user}-{user-id} ... thanks for the explanation. In my jupyterhub_config.py I have a whitelist of 3 or 4 users for testing ... and I'm at the same time authenticating these users against a gitlab authenticator .... and I noticed as I was bringing up and down the helm chart ... it was sometimes incrementing a different id against my user ... see the case for my username below ... where 1,2,3 and 4 got appended ... and so, there wasn't really a consistency in term of which PV got appended to a container. Perhaps my jupyterhub.sqlite was somehow getting corrupted for this to have happenend.
Ref. passing image to get spawned.
If dynamic it might be a little more difficult, but not impossible.
OK, based on heavy prompting from @minrk ... I was able to modify jupyterhub_config.py to include this ... which was able to pick up new 'image' payloads passed to the JupyterHub API.
from traitlets import observe
from kubespawner.spawner import KubeSpawner
class MySpawner(KubeSpawner):
@observe('user_options')
def _update_options(self, change):
options = change.new
if 'image' in options:
self.singleuser_image_spec = options['image']
c.JupyterHub.spawner_class = MySpawner
So all the other c.KubeSpawner
entries required in the jupyterhub_config.py
then got changed to c.MySpawner
.
I then pass this API call to jupyterhub ... and it appears to work. I have obviously pushed that image to my private docker registry first.
curl -v -X POST -H "Authorization: token my-testuser-token" \
"http://jupyterhub.myserver.com/hub/api/users/testuser/server" \
-d '{"image": "my-private-registry/my-simple-singleuser:v1.1"}'
However, it's not bullet-proof. For instance, for larger images (2GB+), I noticed sometimes kubernetes is slow to pull the image ... and so you end up in this situation (see table below) ... where it eventually aborts ... which isn't ideal. However, I found deleting the pod and then retrying the above seemed to resolve. Maybe there's a better approach of pulling these images down to kubernetes ahead of time ... or maybe there's better performance if the images are pushed to a google registry (on the assumption one is using their kubernetes implementation, of course).
NAME READY STATUS RESTARTS AGE
jupyter-testuser-4 0/1 ContainerCreating 0 6m
jupyter-testuser-4 0/1 ImagePullBackOff 0 8m
jupyter-testuser-4 0/1 ErrImagePull 0 12m
Obviously once the image is pulled to the kubernetes cluster, then spawning from the hub is a matter of seconds.
Ref multi-servers per user ... yes, I've been keeping an eye on this and this.
Ref. version-control ... I'm using a self-hosted gitlab rather than github. They have a similar user-token concept, so maybe, as you said, passing that as a 'secret' or 'config map' variable per user, might work.
Given that I'm experimenting with spawning into 'lab' environments, rather than the classic notebook 'tree', I've been looking for ways to pass a template ... a bit like how the notebooks.azure.com implementation below (although they are still working against the classic notebook).
It seems doing the same for jupyterlab is a bit more involved (see this issue), requiring a plugin on the jupyterlab end, but it appears some of the required tooling is in place with jupyterhub-labextension. I'm not sure this is ready for usage yet though.
If it were, then maybe one could potentially give a rudimentary way of pushing/pulling to a repo, by exposing, in my case, the gitlab API via some buttons on that template. I would be interested in whether you thought that viable or not.
Anyway, thanks for the dialogue on these matters.
@Analect I love how you thoroughly documented your thoughts in this issue! :heart:
I'm closing it now as it is stale and doesn't seem to have a specific action point related to it.
@yuvipanda Thanks for all your work on kubespawner. I've started experimenting with running jupyterhub on kubernetes, largely thanks to this spawner, but I wanted to get some guidance around my use-cases / workflow from someone a bit more seasoned in this technology. I'm structuring these as a series of high-level questions, where your input would be be much appreciated. For ease of explanation, I may refer to the rough sketch below lower down.
My efforts so far, for context: I was working through the data-8/jupyterhub-k8s implementation, which I think bases itself off your work, since it's structure in a chart form (fro helm) is the easiest to work with, compared to some of the other implementations I've found out there.
I modified that set-up slightly to handle gitlab authentication (rather than google), which worked OK, but I wasn't able to get the spawning of their large user image (>5GB), based on this Dockerfile and their hub image to work. It was constantly stuck in a
Waiting: ContainerCreating
state and would then try to re-spawn itself. I haven't figured out what the problem is, but there appears to be plenty of space on the cluster. I'm using v1.51 of kubernetes on GCE.Anyway, I ended up getting things working using instead the hub image (dockerfile below), a variation of the data-8 one, in conjunction with your yuvipanda/simple-singleuser:v1 user image.
This was able to spawn new user persistent volumes, bind them to PVCs and obviously spawn user jupyter notebook servers, which could be stopped/started and re-use the same PV. My initial tests as to whether new files/notebooks were getting persisted on the PV were failing, since I wasn't saving them under
/home
, which is where the binding to the volume is happening.i. user management / userid - After various aborted attempts to get the larger data-8 user image working, and where user PVs weren't deleted. I noticed that the
userid
appended to username for naming the PV incremented up, but it wasn't clear where this numbering logic was coming from, as it wasn't a env variable in any of the manifests. Is this some fail-safe of some sort?Currently, I'm using a whitelist
userlist
for users (see code from jupyterhub_config.py) below, and these correspond with my users' gitlab logins that I'm authenticating against. However, it's probably not a clean solution. I see you are working on another approach on the fsgroup and just wanted to get a better understanding around the context of this solution?ii. possibility for interchangeable images - I find the current default set-up with Jupyterhub allowing for spawning a single image very limiting. I can see from #14 that you are considering extending functionality in the kubespawner to allow for an image to be selected. @minrk was able to confirm over here that it could be possible to pass this image selection programmatically via the jupyterhub API, although I'm not sure, as per this issue, as to whether the hub API will work in a kubernetes context.
You pointed to an implementation by Google here. It's not clear to me where they are deriving their list of available images. How do you think something like this should work?
As per the sketch up top, I'm looking to handle a set-up where users have various private/shared repos (marked 1 above in sketch), from which docker images are generated and stored in a registry (2 above). Then my users (3 above) would be able to spawn a compute environment for their chosen repo and have it spawned in kubernetes (4 above), with the possibility, from 5 above, to have the repo cloned (maybe leveraging gitRepo) and for any incrimental work performed on it, while on the notebook server, persisted (6).
iii. multiple simultaneous servers per user based on different images - As far as I understand, it's not possible with jupyterhub to presently allow a user to have multiples instances of a notebook server, each running a different image? Do the tools exist within kubernetes to potentially facilitate this? Thinking out loud, could this be facilitated by having multiple smaller persistent volumes for a user, based on the repo from which the server image is derived? Or maybe this could be achieved within a single PV, by using the subPath functionality?
iv. ideas around version-control - Given the various advantages derived from using kubernetes to host jupyter, I would be curious if you had some thoughts around whether kubernetes also potentially makes it easier to manage version control for notebooks and other files created while in a user works in a notebook server environment. Perhaps something like preStop hooks could be used to commit and push changes prior to a container shutting down.
Even facilitating a user to be able to run git commands from a notebook server terminal .. and have SSH keys back to the version-control system handled via the kubernetes secrets/config maps might be a start. Have you seen any implementations solving this?
Thanks for your patience in reading through this!