jupyterhub / jupyterhub-on-hadoop

Documentation and resources for deploying JupyterHub on Hadoop
https://jupyterhub-on-hadoop.readthedocs.io
BSD 3-Clause "New" or "Revised" License
18 stars 6 forks source link

Provide Cloudera Parcel Instructions #1

Open jcrist opened 5 years ago

jcrist commented 5 years ago

It would be good to provide instructions on how to do an install using cloudera parcels. The code for creating these parcels should also live in this repository.

jcrist commented 5 years ago

cc @sodre

sodre commented 5 years ago

@jcrist, I'll start looking into this, but it will take longer than before because I lost access to the Cloudera Cluster I was using for testing.

sodre commented 5 years ago

@jcrist, heads up... here is a link to the initial version of what I am doing for this PR

sodre commented 5 years ago

When I last worked on creating parcel's and CSDs I had a git repo setup for each project and build the artifacts using Travis CI. The artifacts were then uploaded to git whenever there was a new tag. The reason I had two repos with two different versions/tags are:

Let me know how you would like me to proceed, i.e. create the two repos and add you as a collaborator, or something-else entirely. I am completely open!

jcrist commented 5 years ago

This is really cool to see!

Let me know how you would like me to proceed, i.e. create the two repos and add you as a collaborator, or something-else entirely. I am completely open!

I think I'd prefer to keep them all together - minimizing travis workloads can be done with with a bit of configuration of travis (its possible to skip the parcel build if nothing has changed with a bit of git-fu). Having everything in the same repo makes it more obvious what exists and where to contribute.


Since the cloudera docs are a bit sparse here, can you talk through the components briefly? I'm not sure how everything (parcels, CSD, manifests, etc...) interacts. A few specific questions:

Also, how could I be most helpful here?

sodre commented 5 years ago

This is really cool to see!

Awesome, I will keep working on it!

I think I'd prefer to keep them all together - minimizing travis workloads can be done with with a bit of configuration of travis (its possible to skip the parcel build if nothing has changed with a bit of git-fu). Having everything in the same repo makes it more obvious what exists and where to contribute.

No problem, once I have both the CSD and PARCEL generation files can living under one repo and working with travis. I will create a PR moving sodre/jupyterhub-on-hadoop-cloudera to jcrist/jupyterhub-on-hadoop/cloudera. At that point, I will need your help on how to split the different travis jobs.

Since the cloudera docs are a bit sparse here, can you talk through the components briefly? I'm not sure how everything (parcels, CSD, manifests, etc...) interacts. A few specific questions:

  • Do you have a good resource for learning about this?

The Cloudera docs are not the best on that front. All I know on this topic I learned by referring to the cm_ext wiki, implementing the CSD for NiFi nomr/nifi-parcel, nomr/nifi-csd and looking at the CSD's for the Cloudera supplied services (e.g. impala, yarn, zookeeper, etc). One lesson that I learned is that it pays dividends to write the CSD configuration files in yaml and have them converted as part of the build-scripts to Json.

In a nutshell, CSDs is what most people interact with inside ClouderaManager, it is what allows one to configure, start and stop a service. The parcels are essentially one big .tar.gz file with some metadata attached to it.

  • Does a parcel define where it is installed, (e.g. /usr/lib/jupyterhub) or should the parcel be able to work from any install directory?
The definition of where a parcel is installed is cluster-wide and controlled by the cluster admin. The default location is on `/opt/cloudera/parcels/<name>-<version>`.  Ideally, a parcel should work from any directory.  Realistically, I think it is sufficient to put a caveat on the documentation and provide a clear error message if the cluster configuration deviates from the default. It is not complicated to regenerate the parcel based off a different `PARCEL_ROOT`
  • How much of the configuration can be handled by Parcels/CSDs? Can the jupyterhub user be created with proxy user permissions? Can it create keytabs automatically? Open ports?

The Parcels's define what users and groups CM should create before it is installed. The CSD's have complete control over JupyterHub's own configuration, start and stop scripts. In a secure cluster, the JupyterHub service/user would have Kerberos tokens generated automatically by Cloudera Manager. As far as the proxy configuration, I don't recall if it can be changed automatically by the CSD, but this is a well-known setting in CM (e.g. it is used by hue), so we should be able to point the cluster-admin to exactly where the change needs to be made. JupyterHub will not run as root, so the CSD will not change firewall rules. Usually in the cluster the nodes are all allowed to talk to each other, so that should be okay. The additional requirement/suggestion is that JupyterHub or the ingress-controller be run on a Cloudera Gateway node.

Also, how could I be most helpful here?

At the moment, I would suggest reading the cm_ext wiki page so the when I issue the PR things are are not completely new to you.

Second, it would be good to think about the way forward for the spawned processes, i.e. do you want folks to conda-pack their own environments, or should we create a separate parcels?

Lastly, where would you want to host the .parcel and csd.jar? For my NiFi work I hosted them directly out of the GitHub Releases page, which made the install relatively easy for non-airgapped networks.