pangeo-data / helm-chart

Pangeo helm charts
https://pangeo-data.github.io/helm-chart/
21 stars 26 forks source link

Clone external notebooks repo #44

Closed martindurant closed 6 years ago

martindurant commented 6 years ago

Instead of baking notebooks into the docker image, download from a notebook-only repo upon container instantiation.

Notebook names are tagged with commit date, so that edits by a user do not get overwritten, then just get more notebooks.

martindurant commented 6 years ago

Potential fix for https://github.com/pangeo-data/pangeo/issues/242

martindurant commented 6 years ago

(note that the repo is on my github - naturally it can be moved)

rabernat commented 6 years ago

This looks perfect! I just added you to the pangeo-data org so you can create the examples repo here.

rabernat commented 6 years ago

Have you tested this in a real cluster? If so, can you tell me what you see in your examples directory?

martindurant commented 6 years ago

This seems to do the right thing when building and running the image locally. I can't think of how the bash-foo would fail, but there may be corner cases, wouldn't mind someone to have a look. Since the tags are by date, is there a chance of an update overwriting a user-edited notebook on a particular day.

jovyan@aea4f113b634:~$ ls examples/
cm26_2018-06-18.ipynb                 PROVIDED_EXAMPLE_NOTEBOOKS.md
custom-computations_2018-06-18.ipynb  scikit-image-example_2018-06-18.ipynb
dask-array_2018-06-18.ipynb           sea-surface-height_2018-06-18.ipynb
machine-learning_2018-06-18.ipynb     xarray-data_2018-06-18.ipynb
martindurant commented 6 years ago

Is the testing path for images to build custom image snapshots, push to dockerhub and configure the chart to point to that?

rabernat commented 6 years ago

Is the testing path for images to build custom image snapshots, push to dockerhub and configure the chart to point to that?

Yes, but that is probably overkill here. I would probably just merge this directly and update the cluster.

Does anyone else have thoughts on this way of structuring the examples directory? @jhamman?

martindurant commented 6 years ago

OK, tested on a real cluster http://35.232.241.195/, also got

jovyan@jupyter-martindurant:~$ ls examples/
cm26_2018-06-18.ipynb                 dask-array_2018-06-18.ipynb        PROVIDED_EXAMPLE_NOTEBOOKS.md          sea-surface-height_2018-06-18.ipynb
custom-computations_2018-06-18.ipynb  machine-learning_2018-06-18.ipynb  scikit-image-example_2018-06-18.ipynb  xarray-data_2018-06-18.ipynb
jovyan@jupyter-martindurant:~$

(the dates refer to the latest commit, yesterday)

mrocklin commented 6 years ago

So over time if I log in frequently will I eventually get tons of notebooks in my examples directory? Does anyone have thoughts on how to avoid this?

On Tue, Jun 19, 2018 at 5:19 PM, Martin Durant notifications@github.com wrote:

OK, tested on a real cluster, also got

jovyan@jupyter-martindurant:~$ ls examples/ cm26_2018-06-18.ipynb dask-array_2018-06-18.ipynb PROVIDED_EXAMPLE_NOTEBOOKS.md sea-surface-height_2018-06-18.ipynb custom-computations_2018-06-18.ipynb machine-learning_2018-06-18.ipynb scikit-image-example_2018-06-18.ipynb xarray-data_2018-06-18.ipynb jovyan@jupyter-martindurant:~$

(the dates refer to the latest commit, yesterday)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/pull/44#issuecomment-398549633, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszI65na05ADLnPTmuWrvrU7zR06ksks5t-WrOgaJpZM4UsA64 .

martindurant commented 6 years ago

Yes, you'll get new ones when the examples repo is updated (not frequent) if the setup script gets run, which I think is whenever the container is recreated.

rabernat commented 6 years ago

I would prefer if the date tag were from the latest commit when that specific notebook was last updated. So a new date tag only appears for the same notebook if that example is actually modified upstream. (That’s how I thought this would work, but perhaps I misunderstood.)

We definitely want to avoid filling up the directory with lots of identical duplicate notebooks.

Sent from my iPhone

On Jun 19, 2018, at 5:23 PM, Martin Durant notifications@github.com wrote:

Yes, you'll get new ones when the examples repo is updated (not frequent) if the setup script gets run, which I think is whenever the container is recreated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

martindurant commented 6 years ago

when that specific notebook was last updated

No, indeed, that's not what it was, but I am changing it, just a moment.

mrocklin commented 6 years ago

I still think it might be reasonable to overwrite known files every time, and include a loud DONT_SAVE_DATA_HERE.md file in the directory. This is a simple solution that removes burden from of the maintainers. It's not fancy and is unlikely to break. I accept that this may be too annoying to users to accept, but I wonder if there is another non-fancy solution that we could do that would be less likely to need to be tweaked in the future by one of us.

On Tue, Jun 19, 2018 at 5:30 PM, Ryan Abernathey notifications@github.com wrote:

I would prefer if the date tag were from the latest commit when that specific notebook was last updated. So a new date tag only appears for the same notebook if that example is actually modified upstream. (That’s how I thought this would work, but perhaps I misunderstood.)

We definitely want to avoid filling up the directory with lots of identical duplicate notebooks.

Sent from my iPhone

On Jun 19, 2018, at 5:23 PM, Martin Durant notifications@github.com wrote:

Yes, you'll get new ones when the examples repo is updated (not frequent) if the setup script gets run, which I think is whenever the container is recreated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/pull/44#issuecomment-398552727, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszJ-tKleH7P-XYzlnHJd9Y1LMeD_4ks5t-W1qgaJpZM4UsA64 .

martindurant commented 6 years ago

I'm not sure I would call what I've done so far "fancy"... the code copies in some notebook files, and the point really was to separate the examples into a separate repo, so we can edit them separately. It happens to add a date-tag, and there's a possibly informative empty file in the directory.

mrocklin commented 6 years ago

I'm entirely in favor of moving the examples to a separate repo and copying them in. I'm somewhat nervous about appending a date, because I anticipate that eventually it'll get annoying and ugly to have very many example notebooks that aren't helpful and we'll have to come up with another solution.

I think it's simpler long-term to just overwrite notebook files in that directory. I think that users will adapt, much as they've adapted to having their local software environment nuked every time they get a new container. I think that eventually users will learn to create a copy of the provided examples if they want to adapt one long term. I think that we can help them with that with an informative and brief README file.

On Tue, Jun 19, 2018 at 5:47 PM, Martin Durant notifications@github.com wrote:

I'm not sure I would call what I've done so far "fancy"... the code copies in some notebook files, and the point really was to separate the examples into a separate repo, so we can edit them separately. It happens to add a date-tag, and there's a possibly informative empty file in the directory.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/pull/44#issuecomment-398557013, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszK9dK8hrNx1k-THMOCv-kAnNEd_5ks5t-XFggaJpZM4UsA64 .

martindurant commented 6 years ago

For my part, I am happy either way, since I am a newcomer to pangeo and don't have a good feel for what workflow would go smoothly and what wouldn't - happy to defer the decision to others.

If we are to have single copies and overwriting, I would suggest at least creating a new, empty directory "work". Btw: I think all of these file-overview issues would work much better if the files pane in jupyter-lab were tree-like rather than double-click-and-breadcrumb. I don't know if there is an issue for that.

jhamman commented 6 years ago

Just dropping in to say I'm a fan of overwriting these examples every time (this is the binder approach). I think putting a warning README like @mrocklin suggests is sufficient.

I'm entirely in favor of moving the examples to a separate repo and copying them in.

Me too!

What about cloning this example repo on startup:

rm -rf examples
git clone https://github.com/pangeo-data/pangeo_examples examples
echo "Files in this directory should be treated as read-only"  > examples/DONT_SAVE_DATA_HERE.md
martindurant commented 6 years ago

That's two votes - any further comments?

martindurant commented 6 years ago

So, any feeling on making the files truly read-only (so you get an error if you edit and try to save), or have a "You should save as" message at the top?

jhamman commented 6 years ago

This may lead to unsavory errors when jupyter autosaves files. Also, users may want commit/push changes to the examples.

rabernat commented 6 years ago

I think I am biased towards the idea of date tagging because I personally have augmented the existing example notebooks with quite a bit of new code. I would be upset if these were overwritten because I would lose my work. But I think I am not a typical user, and I should probably stop doing that. A typical user is probably just creating new notebooks rather than modifying the examples heavily.

The ideal solution would be to have the example files be read only, and then have Jupyter itself prompt the user to "save as" if they are modified. This is not yet supported in Jupyter:

The best interim solution is probably just to make the example notebooks "read only".

I feel bad that I encouraged @martindurant to develop this date tagging stuff, which now seems to be contrary to what people want. Martin, I appreciate the time you spent on this feature!

martindurant commented 6 years ago

No worries, @rabernat , often we don't get a sense for how a workflow will feel without seeing it in action.

martindurant commented 6 years ago

Also https://github.com/jupyterlab/jupyterlab/issues/4729

martindurant commented 6 years ago

Things have gone quiet here. As things stand, files in examples/ would be overwritten, there is a warning file, but nothing to prevent users from editing. The point was to move the examples into their own repo, which has been done.

jacobtomlinson commented 6 years ago

This is a really nice feature!

It would be great to have some ability to customise it. For example we could add two environment variables, one to enable/disable this functionality and one to set the git repo url.

By default I would expect it to be enabled and pointing at the official examples directory, but it would be great to be able to disable it or point to an alternative repo.

martindurant commented 6 years ago

@jacobtomlinson , for those that know what they are doing, I would have thought it reasonable to expect real git commands in the terminal, no?

jacobtomlinson commented 6 years ago

@martindurant I'm not sure I follow. I would like to be able to customize this functionality when I install "a pangeo". The best way to do this is via environment variables.

martindurant commented 6 years ago

Sorry, I see what you mean - I was thinking of those wanting to, for example, push changes to some forked examples-notebook repo. I daresay the repo origin URL can be configurable. I don't know about disabling, though - what should happen in that case?

martindurant commented 6 years ago

@jacobtomlinson , I was about to implement your suggestion, but it occurred to me: what happens if the two repos do not share enough common history? Assuming a git remote set-url command to a value specified by the config, would @jhamman 's git commands still be guaranteed to run?

martindurant commented 6 years ago

I have implemented @jhamman 's suggestion, and allowed for a new environment variable, EXAMPLES_GIT_URL which, if given, will be the source of the notebooks. If the directory already contains stuff, then the initial clone will fail (which is fine) and you should only get trouble if you initially clone from one repo and later provide a different URL, and the two repos cannot merge.

martindurant commented 6 years ago

I should clarify: if the examples directory contains a git repo from before, this will update it; if it contains work but no git repo (because it is from an earlier time before this PR is merged), then nothing happens at all, and the user will have to do any git manipulation on their own.

martindurant commented 6 years ago

Any further thoughts here? This PR is getting old...

martindurant commented 6 years ago

OK, @jhamman , I can make those changes - I had been happy just to let the commands fail.

jhamman commented 6 years ago

I am satisfied with the current implementation. Any opposition to moving this to master and setting it loose?

mrocklin commented 6 years ago

In favor of moving forward generally

On Thu, Jul 12, 2018 at 4:01 PM, Joe Hamman notifications@github.com wrote:

I am satisfied with the current implementation. Any opposition to moving this to master and setting it loose?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/helm-chart/pull/44#issuecomment-404649367, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszNnKiiZ4WAUY3tGYxcTiADRD5zT_ks5uF7kSgaJpZM4UsA64 .