googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
974 stars 249 forks source link

Poor support of Datalab on Dataproc #1976

Open jimbojetlag opened 6 years ago

jimbojetlag commented 6 years ago

After following instructions at https://cloud.google.com/dataproc/docs/tutorials/dataproc-datalab I was able to setup Datalab notebooks with access to a Hadoop cluster. But then I noticed that the main benefits of Datalab, such as persisting notebook data on disk and backups on gs are absent. Please correct me if this is not true.

If this is the case, what is the point of that tutorial? Is it just a proof of concept? No serious work can be done when the notebooks cannot be persisted.

Is there a work around for this?

jimbojetlag commented 6 years ago

To add to the previous problem, there is also the lack of git support. This has been asked in stackoverflow, and so far no reply from the team:

https://stackoverflow.com/questions/46416967/no-datalab-repository-created-when-using-datalab-init-script-to-create-dataproc#comment80330792_46416967

At this point, I'm not sure hot to persist and share Datalab notebooks on Dataproc.

evanyeatts commented 6 years ago

Enabling the automatic Cloud Storage snapshotting is actually quite straightforward (thought not documented). First, ensure that the Dataproc cluster is given appropriate --scopes to create buckets and write data out to Cloud Storage. Then you can add:

ENV DATALAB_SETTINGS_OVERRIDES='{"enableAutoGCSBackups": true}'

to the Dockerfile injection here: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/datalab/datalab.sh#L92

We maintain a "fork" of the initialization action which we keep in one of our Cloud Storage buckets and reference that instead of the one Google hosts.

evanyeatts commented 6 years ago

You can also leverage the gcloud cli installed in the Datalab image to clone and bootstrap Cloud Source repositories. Then you can commit/etc to that repo via Ungit in the Datalab UI, allowing you to source control your notebooks like you would with a managed Datalab instance. You can do this via a similar addition to the Dockerfile injection:

gcloud source repos clone foobar /content/foobar

One trick is to have an initialization action that creates a bash script on the Dataproc machine, then pull in the script and run it via the Dockerfile injection. Have that bash script do the extra steps like bootstrapping cloud source, etc. Then you only have to make changes to that script.

ADD /tmp/setup.sh /tmp/setup.sh   # create /tmp/setup.sh in a separate initialization action
RUN /tmp/setup.sh