Open jimbojetlag opened 6 years ago
To add to the previous problem, there is also the lack of git support. This has been asked in stackoverflow, and so far no reply from the team:
At this point, I'm not sure hot to persist and share Datalab notebooks on Dataproc.
Enabling the automatic Cloud Storage snapshotting is actually quite straightforward (thought not documented). First, ensure that the Dataproc cluster is given appropriate --scopes
to create buckets and write data out to Cloud Storage. Then you can add:
ENV DATALAB_SETTINGS_OVERRIDES='{"enableAutoGCSBackups": true}'
to the Dockerfile injection here: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/datalab/datalab.sh#L92
We maintain a "fork" of the initialization action which we keep in one of our Cloud Storage buckets and reference that instead of the one Google hosts.
You can also leverage the gcloud cli installed in the Datalab image to clone and bootstrap Cloud Source repositories. Then you can commit/etc to that repo via Ungit in the Datalab UI, allowing you to source control your notebooks like you would with a managed Datalab instance. You can do this via a similar addition to the Dockerfile injection:
gcloud source repos clone foobar /content/foobar
One trick is to have an initialization action that creates a bash script on the Dataproc machine, then pull in the script and run it via the Dockerfile injection. Have that bash script do the extra steps like bootstrapping cloud source, etc. Then you only have to make changes to that script.
ADD /tmp/setup.sh /tmp/setup.sh # create /tmp/setup.sh in a separate initialization action
RUN /tmp/setup.sh
After following instructions at https://cloud.google.com/dataproc/docs/tutorials/dataproc-datalab I was able to setup Datalab notebooks with access to a Hadoop cluster. But then I noticed that the main benefits of Datalab, such as persisting notebook data on disk and backups on gs are absent. Please correct me if this is not true.
If this is the case, what is the point of that tutorial? Is it just a proof of concept? No serious work can be done when the notebooks cannot be persisted.
Is there a work around for this?