googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
974 stars 249 forks source link

Import "local" files while executing remotely #1181

Open dovy opened 7 years ago

dovy commented 7 years ago

Method: Gateway mode

It would be really useful to be able to import files (or other notebooks) from the local filesystem and then execute them remotely. Yet, all interaction within cells focuses on what's on the cluster/remote machine. If there were a way to use the local file, that would be fantastic.

Ideas?

nikhilk commented 7 years ago

There are some ideas discussed around mounting the local file system within the VM/container, and that would enable this -- we'd have to do a few things to make that happen. Additionally, some perf/latency related things to validate.

Just to collect more concrete info on the motivations... is this because you have notebooks that aren't in a git repo and/or using have your git repo connected with cloud source repositories is a non-starter in your case? Or some other motivations?

dovy commented 7 years ago

We're working to develop a common way for scientists and engineers to work allowing for a unified execution experience. Our current plan is as follows:

The goal is essentially to develop locally, test with a cluster, and eventually run a notebook as a job, or a previously exported python file. Right now I've been working to test out the jupyter to python export, but it fails if a "magic" is utilized. See #1179

So the desire to use local files is during testing and development, while the desire for production is to package the common library up and submit it as a job to dataproc.

;)

Now that you see our desire... I'd love to find a way to have local including work so I can inherit from other notebooks as well as a common library. ;)

nikhilk commented 7 years ago

Your scenarios make sense -- in fact they line up nicely with things on our roadmap - esp. dev/test/deploy.

In terms of the current set of capabilities, I wonder if some (maybe most?) of your scenarios can be accomplished by including repositories containing the common library pieces as git submodules within the repository containing notebooks, or as pip installed libraries within the datalab environment?

On the topic of magic -> deployable code, our early thinking has been about having a well-defined way to determine which parts of a notebook are deployable, and about having magics produce deployable code equivalents as their output, so the notebook -> code conversion tool can pick them up.

dovy commented 7 years ago

@nikhilk I had thought of that, and that's what I was planning to do. However, that removes the potential to update said libraries and testing without committing to the remote repo, which would launch the CI testing. If possible I'd love a more elegant solution that wouldn't require useless commits of code that might not be ready to commit.

And yes, your plan was exactly along my line of thinking. I don't imagine it would be too difficult to accomplish this, but it would require a custom extension of sorts for the output.

Another idea I had, was perhaps to allow python export on save (option for datalab) so that one wouldn't have to invoke the saving mechanism. This would also allow for easier code revision comparison since notebooks are not exactly diff-friendly given all the metadata they contain. I'm thinking of modifying my cluster creation script to include that, but it would be better served as an option in the core of Datalab i'd wager. ;)