kubeflow / community

Information about the Kubeflow community including proposals and governance information.
Apache License 2.0
159 stars 220 forks source link

Community Notebook Images Repo #344

Open thesuperzapper opened 4 years ago

thesuperzapper commented 4 years ago

As was discussed in the last few community meetings, there is need for a new repo with community supported Kubeflow Notebook images. For example: Jupyter, JupyterLab, RStudio, Visual Studio Code, Zeppelin, etc.

Basic Idea:

This issue is to discuss if/how this should be established.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/jupyter 1.00

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

lalithvaka commented 4 years ago

@thesuperzapper , thank you for submitting this. Not only images, I would also store the Dockerfile's as well for other's modify and use as they seem fit. /cc @jlewi

jlewi commented 4 years ago

See kubeflow/kubeflow#1643

And see: kubeflow/kubeflow#2208

Instead of curating jupyter images; what if we just fixed kubeflow/kubeflow#2208 so existing jupyter images could run on Kubeflow?

thesuperzapper commented 4 years ago

@jlewi, that doesn't really fix things like RStudio, Visual Studio Code, etc

jlewi commented 4 years ago

@thesuperzapper Why not? Are there not existing Jupyter images for those use cases?

thesuperzapper commented 4 years ago

@jlewi those are literally not Jupyter, they are different IDE's

Am I misunderstanding you?

jlewi commented 4 years ago

@thesuperzapper I guess what I meant is aren't there already public docker images for RStudio, Visual Studio Code, etc... out there?

My question is more along the lines of; do we as Kubeflow need to be in the business of curating images for different environments/IDEs etc...? Or is it sufficient if make it easy for people to bring their own image for their favorite web app?

thesuperzapper commented 4 years ago

@jlewi, there is nothing saying we cant use those public images as a base for our images.

Most of what our images would be doing is adding support for prefix rewriting (how Kubeflow rewrites under the pod name path in the URL), and setting configs which are optimised for K8S/Kubeflow running.

For example, the most popular docker image for R, has no way of doing prefix rewriting, so my version uses an embedded apache server with a proxy rewrite directive.

(However, will look into using the R basse images provided by RStudio themselves for the kubeflow community repo)

jlewi commented 4 years ago

@thesuperzapper it sounds like the problem you might be describing is different then the hosting of a repo.

It might be helpful to focus on or 2 applications (e.g R and arbitrary Jupyter images) and explore various ideas for how we could run them.

E.g for path rewriting maybe we can use ISTIO?

Building our own docker images even reusing existing images as base images is less desirable IMO then making it easy for people to use their existing images out of the box

thesuperzapper commented 4 years ago

@jlewi Is there really a negative to curating docker images? Without that, the platform is much harder to begin using for non-expert users. (+ we are already distributing the tensorflow images)

I honestly believe most people are beginning to use Kubeflow due to the notebook servers, so we should make this experience as easy as possible.

jlewi commented 4 years ago

@thesuperzapper The issue with curating docker images is maintaining them over time and setting appropriate expectations. Who would actually be responsible for updating and shipping new images with each Kubeflow release? If there isn't a group of individuals responsible for maintaining them then how do we ensure a good experience over time for users.

If you look at kubeflow/kubeflow#5060 you can see that as a community we are already struggling to find owners just for the two docker images we already have.

See also kubeflow/kubeflow#4789 regarding CD pipelines for the notebook images.

So I completely agree that having a rich set of docker images would be a huge plus. The question is figuring out how we do this in a sustainable way.

Building out the automation (e.g. kubeflow/kubeflow#4789) is probably a necessary pre-requisite to being able to scale this.

@thesuperzapper would you be interested in putting together a proposal for how we would sustainably maintain a set of curated images?

lalithvaka commented 4 years ago

Would it make sense to just post the curated Dockerfiles under one of the sample custom images folders of the Kubeflow repo and let the end users build and maintain for their own needs instead of making the Docker images available and maintaining it? I work in Healthcare and as we use and create custom images related to Healthcare, I could add them to the examples (Ex: NVidia Clara, Nvidia Tensorflow, Nvidia Rapids, Matlab etc..). Each one of these may need different libraries and pip installs added etc.. But we may not need to build and maintain the images itself coz it depends on the end user requirements.

On Sat, Jul 4, 2020 at 7:24 AM Jeremy Lewi notifications@github.com wrote:

@thesuperzapper https://github.com/thesuperzapper The issue with curating docker images is maintaining them over time and setting appropriate expectations. Who would actually be responsible for updating and shipping new images with each Kubeflow release? If there isn't a group of individuals responsible for maintaining them then how do we ensure a good experience over time for users.

If you look at kubeflow/kubeflow#5060 https://github.com/kubeflow/kubeflow/issues/5060 you can see that as a community we are already struggling to find owners just for the two docker images we already have.

See also kubeflow/kubeflow#4789 https://github.com/kubeflow/kubeflow/issues/4789 regarding CD pipelines for the notebook images.

So I completely agree that having a rich set of docker images would be a huge plus. The question is figuring out how we do this in a sustainable way.

Building out the automation (e.g. kubeflow/kubeflow#4789 https://github.com/kubeflow/kubeflow/issues/4789) is probably a necessary pre-requisite to being able to scale this.

@thesuperzapper https://github.com/thesuperzapper would you be interested in putting together a proposal for how we would sustainably maintain a set of curated images?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubeflow/community/issues/344#issuecomment-653772269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7IQIZMUSLLUSGTRB6RT4DRZ43RHANCNFSM4OAERINA .

jlewi commented 4 years ago

@lalithvaka At that point are we just hosting git repos? If you want to publish a Dockerfile why not just create a repo lalithvaka/kubeflow-jupyter-image and publish your files there?

lalithvaka commented 4 years ago

@jlewi, we had challenges building custom images per Kubeflow docs probably due to lack of knowledge. I am suggesting, posting samples on kubeflow git repo and pointing those from the docs would definitely help new users adopting Kubeflow and gives them a boost.

thesuperzapper commented 4 years ago

@jlewi I am, more than happy to make my own personal repo, but I think there are some benefits to having it under the kubeflow org:

  1. people will be more likely to discover it
  2. we can more easily include them in default configs with kubeflow releases
  3. better governance
  4. we can host the images on GCR (rather than Dockerhub)
jlewi commented 4 years ago

@thesuperzapper Do you want to put together a proposal per my earlier comment?

thesuperzapper commented 4 years ago

@jlewi, is there a template for such a proposal?

If not, what will it need to include, and who is the audience?

jlewi commented 4 years ago

you can look at https://github.com/kubeflow/community/tree/master/proposals

The proposal should include a plan for what you want to deliver; e.g. a community repo of jupyter images and how you plan to address the concerns I've raised above. In particular, how will we maintain the quality and keep the images up to date.

misteliy commented 4 years ago

@thesuperzapper do you have a working version of Rstudio Dockerfile for Kubeflow that you can share?

lalithvaka commented 4 years ago

@thesuperzapper any update on the proposal? Thank you.

jlewi commented 4 years ago

I think any progress is probably blocked by the formation of a WG to own existing assets. All new projects need to be sponsored by a WG.

kubeflow/community#379 is tracking the WG for notebooks.

thesuperzapper commented 4 years ago

Agree that Notebook WG should manage the Notebook images, and however they are released.

FYI, I put a JupyterLab (With Spark) image here: https://github.com/kubeflow/kubeflow/issues/5305#issuecomment-693115489

davidspek commented 4 years ago

A while back I created a dockerfile which uses the jupyter/scipy-notebook and essentially creates jupyter/datascience-notebook with support for R and Julia 1.5 (python version is 3.8, if used with ml-metadata version 0.24.0 is required). The r-base is version 4.0.2 instead of 3.6.3 and as such rpy2 is not compatible and has been removed and a few extra packages from the r-notebook have been added. If r-base 3.6.3 with rpy2 is wanted this can easily be changed in the dockerfile. I believe this is a good starting point for anybody that is using the standard jupyter notebooks, as they also use jupyter/scipy-notebook as their base image. Here is the dockerfile for those that were having trouble with using standard jupyter images:

# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM $BASE_CONTAINER

LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

# Set when building on Travis so that certain long-running build steps can
# be skipped to shorten build time.
ARG TEST_ONLY_BUILD

# Fix DL4006
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

USER root

# R pre-requisites
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    fonts-dejavu \
    gfortran \
    gcc \
    libnetcdf-* \
    udunits-bin \
    libudunits2-dev \
    netcdf-bin && \
    rm -rf /var/lib/apt/lists/*

# Julia dependencies
# install Julia packages in /opt/julia instead of $HOME
ENV JULIA_DEPOT_PATH=/opt/julia
ENV JULIA_PKGDIR=/opt/julia
ENV JULIA_VERSION=1.5.0

WORKDIR /tmp

# hadolint ignore=SC2046
RUN mkdir "/opt/julia-${JULIA_VERSION}" && \
    wget -q https://julialang-s3.julialang.org/bin/linux/x64/$(echo "${JULIA_VERSION}" | cut -d. -f 1,2)"/julia-${JULIA_VERSION}-linux-x86_64.tar.gz" && \
    echo "be7af676f8474afce098861275d28a0eb8a4ece3f83a11027e3554dcdecddb91 *julia-${JULIA_VERSION}-linux-x86_64.tar.gz" | sha256sum -c - && \
    tar xzf "julia-${JULIA_VERSION}-linux-x86_64.tar.gz" -C "/opt/julia-${JULIA_VERSION}" --strip-components=1 && \
    rm "/tmp/julia-${JULIA_VERSION}-linux-x86_64.tar.gz"
RUN ln -fs /opt/julia-*/bin/julia /usr/local/bin/julia

# Show Julia where conda libraries are \
RUN mkdir /etc/julia && \
    echo "push!(Libdl.DL_LOAD_PATH, \"$CONDA_DIR/lib\")" >> /etc/julia/juliarc.jl && \
    # Create JULIA_PKGDIR \
    mkdir "${JULIA_PKGDIR}" && \
    chown "${NB_USER}" "${JULIA_PKGDIR}" && \
    fix-permissions "${JULIA_PKGDIR}"

USER $NB_UID

# R packages including IRKernel which gets installed globally.

RUN conda install --quiet --yes \
    'r-base=4.0.2' \
    'r-caret=6.0*' \
    'r-crayon=1.3*' \
    'r-devtools=2.3*' \
    'r-forecast=8.12*' \
    'r-hexbin=1.28*' \
    'r-htmltools=0.5*' \
    'r-htmlwidgets=1.5*' \
    'r-irkernel=1.1*' \
    'r-nycflights13=1.0*' \
    'r-plyr=1.8*' \
    'r-randomforest=4.6*' \
    'r-rcurl=1.98*' \
    'r-reshape2=1.4*' \
    'r-rmarkdown=2.3*' \
    'r-rsqlite=2.2*' \
    'r-shiny=1.5*' \
    'r-tidyverse=1.3*' \
    'unixodbc=2.3.*' \
    'r-tidymodels=0.1*' \
    && \
    conda clean --all -f -y && \
    fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"

# Add Julia packages. Only add HDF5 if this is not a test-only build since
# it takes roughly half the entire build time of all of the images on Travis
# to add this one package and often causes Travis to timeout.
#
# Install IJulia as jovyan and then move the kernelspec out
# to the system share location. Avoids problems with runtime UID change not
# taking effect properly on the .local folder in the jovyan home dir.
RUN julia -e 'import Pkg; Pkg.update()' && \
    (test $TEST_ONLY_BUILD || julia -e 'import Pkg; Pkg.add("HDF5")') && \
    julia -e "using Pkg; pkg\"add IJulia\"; pkg\"precompile\"" && \
    # move kernelspec out of home \
    mv "${HOME}/.local/share/jupyter/kernels/julia"* "${CONDA_DIR}/share/jupyter/kernels/" && \
    chmod -R go+rx "${CONDA_DIR}/share/jupyter" && \
    rm -rf "${HOME}/.local" && \
    fix-permissions "${JULIA_PKGDIR}" "${CONDA_DIR}/share/jupyter"

WORKDIR $HOME

# Configure container startup
EXPOSE 8888
USER jovyan
ENTRYPOINT ["tini", "--"]
CMD ["sh","-c", "jupyter lab --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

EDIT: I remember this not working when I made the above dockerfile a few months ago. However, now using the jupyter/datascience-notebook directly works as well.

ARG BASE_CONTAINER=jupyter/datascience-notebook
FROM $BASE_CONTAINER

# Configure container startup
CMD ["sh","-c", "jupyter lab --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
davidspek commented 4 years ago

@misteliy Here is a dockerfile that has both VS Code and RStudio in it. It is based on the upstream datascience-notebook so includes Python, R and Julia.

ARG BASE_CONTAINER=jupyter/datascience-notebook
FROM $BASE_CONTAINER

USER root

# Install VS Code
RUN curl -fsSL https://code-server.dev/install.sh | sh

# Install RStudio dependency
RUN apt-get update -qq && \
    apt-get install -y --no-install-recommends \
    gdebi-core

ENV RSTUDIO_VERSION 1.1.463

# Install RStudio and another dependency
RUN wget http://archive.ubuntu.com/ubuntu/pool/main/o/openssl1.0/libssl1.0.0_1.0.2n-1ubuntu5.4_amd64.deb -O libssl1.0.0.deb && \
    dpkg -i libssl1.0.0.deb && \
    curl -fsSL "https://download2.rstudio.org/rstudio-server-${RSTUDIO_VERSION}-amd64.deb" > /tmp/rstudio.deb && \
    apt-get install --no-install-recommends -y /tmp/rstudio.deb && \
    rm /tmp/rstudio.deb && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

USER $NB_UID

# Install jupyter proxies for VS Code and RStudio
RUN pip3 install jupyter-server-proxy && \
    pip3 install jupyter-vscode-proxy && \
    pip3 install jupyter-rsession-proxy && \
    jupyter labextension install @jupyterlab/server-proxy

# Configure container startup
CMD ["sh","-c", "jupyter lab --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

Instructions regarding VS code installation I got from here. They might be useful for other applications as well.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

davidspek commented 3 years ago

/lifecycle frozen