jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
264 stars 153 forks source link

Capture more of the kernel spec in the notebook document #49

Open parente opened 8 years ago

parente commented 8 years ago

The kernelspec name and display_name are currently the only pieces of information captured in the notebook document about the kernel and environment in use when the notebook was last saved. Has capturing more of the information available to the kernel spec manager ever been considered?

I ask because info such as the command used to launch the kernel and environment variables can be useful in studying what configuration was used to execute a notebook document. For instance, knowing the binary executed by spec name "my_special_kernel" is more informative than the name alone.

I do see that blindly copying everything from kernel.json files or the equivalent into a notebook is a non-starter: there might be secrets or other bits of information users don't want captured and leaking out with their shared notebooks. That said, capturing a subset of the information might be possible.

parente commented 8 years ago

/cc @ericdill

rgbkrk commented 8 years ago

This is definitely good to start questioning. You hit on exactly why some bits aren't persisted - env or paths that should be secret/hidden.

minrk commented 7 years ago

These are the things I would like to avoid in the notebook document:

In particular, I want to keep the relationship between the notebook and the details of the kernelspec loose. To me, it should be as close as possible to "This is a Python notebook".

Of course, the advantage of including more info in the notebook means that a notebook is more likely to be portable/reproducible. But it's also sending assumptions about my system to yours, which has the opposite effect in practice (You don't have ~/minrk/envs/foo/bin/python, but you do have kernelspec python3, which is all you should need).

I'm not sure that helps make any decisions, but it's a dump of my current thoughts on the matter.

takluyver commented 7 years ago

I broadly agree with Min, but I think it's worth exploring some way to connect notebooks to environments other than exposing environments as kernelspecs:

westurner commented 7 years ago

Is this part of a broader issue of reproduciblity?

From https://wrdrd.com/docs/consulting/education-technology#jupyter-and-reproducibility :

| "Ten Simple Rules for Reproducible Computational Research" | http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

Rule 3: Archive the Exact Versions of All External Programs Used

[... pip freeze, conda env export, dpkg query -l, sys.path (python -m site)

As far as the kernelspec, the Python interpretor (cpython, pypy, ...) and major.minor.patch could be relevant to reproducibility.

Practically, a Dockerfile with pinned versions (with pip freeze or requirements.in (pip-tools)) is probably most reproducible. Otherwise, version_information includes interpreter information.

westurner commented 7 years ago

This is a problem package systems should be solving, not us.

rgbkrk commented 7 years ago

This makes me think that we should try to support environments as a top-level, with the limited subset of working environments specifying under there. Requirement would be that we need conda to declare a spec for this metadata. There would be a similar case for virtual environments.

{
  "environment": {
    "type": "conda",
    "payload": {
      "version": "1.2.1", // version of the _spec_ for this environment
      "packages": []
    }
  }
}

We'd require coordination on maintaining these specifications.

rgbkrk commented 7 years ago

I'm picking poor field names here, just trying to illustrate that we allow a dedicated expansion area for this metadata.

takluyver commented 7 years ago

metadata is an expansion area - it would be fine for an extension to use something like metadata.conda_env to specify something like that. I'm not sure that we gain anything by providing a metadata.environment namespace.

rgbkrk commented 7 years ago

While I realize it's expansion area, my desire is to have specifications / common ground - I want consistency.

rgbkrk commented 7 years ago

Motivation: other notebook environments like Zeppelin and Data Bricks are able to get better adoption because they out-of-the-box package up a JAR for a "notebook" which can be scheduled on a cluster. If we at least leave ground for some of the common ones, people can evolve ones that fit their environment better.

minrk commented 7 years ago

Historically, we've described this kind of information as out-of-scope for the notebook itself, in part because it's already solved well by an environment.yml (or requirements.txt, Dockerfile, etc.) in the top-level directory (see binder, for example). In general, the sharing unit for notebooks is not the notebook, it is the directory/repository. That's why we won't do things like bundle data files, dependencies, scripts into one big bundle.

For instance, if I have a tutorial with a bunch of notebooks, I only want to specify and update the environment in one place. Having to update every notebook with the same information would be a big pain.

Counterexample: one repo with notebooks describing different dependencies, where each notebook actually does want to specify its own fine-grained dependencies, and those dependencies do vary quite a bit (e.g. union of all dependencies in one environment.yml is insufficient).

Counter-counterexample: Directories also solve this problem, as an environment.yml per distinct environment allows separating the granularity from the notebooks themselves.

I think it's great to explore what can be done with something like this, but I struggle to think of a scenario where I would recommend that someone actually use it instead of using a 'normal' environment specification.

@rgbkrk to your point of more easily deploying things, locating the repo-level environment.yml should be just as doable as the notebook-level env metadata, and might be worth exploring in addition. Setting aside where the information came from (a totally separate question, I think), I guess the high level goal is "given a notebook and dependencies, run the notebook somewhere that definitely has those dependencies". Binder already is this, of course, but building a mini version into Jupyter (or JupyterHub or kernel gateway) seems valuable.

takluyver commented 7 years ago

Maybe it would be useful to have some kind of tool that could create an environment from a specification file and mark a set of notebooks (e.g. notebooks in the same directory as the spec file) as using that environment. At present, you either have to run the notebook server inside the newly created environment (totally possible, but not always convenient), or manually select the relevant kernel spec for each one.

I think this loops back to my thought before: it would be good to have some way to locally associate notebooks with a given environment. This feels like a separate notion from the kernelspec, which is meant to be relevant across machines. I'm not sure where this would be stored or how the interface would work, though.

rgbkrk commented 7 years ago

For instance, if I have a tutorial with a bunch of notebooks, I only want to specify and update the environment in one place. Having to update every notebook with the same information would be a big pain.

This is actually why I want to have it be automatic and easy.

There's no way for me to tell that multiple notebooks belong to the same environment, programmatically.

some kind of tool that could create an environment from a specification file and mark a set of notebooks

I like this - it means at the very least that the notebook can have metadata that specifies the local file with dependencies.

I'm not sure where this would be stored or how the interface would work, though.

Me either.

parente commented 7 years ago

What about a simple spec like the following that is extensible in type, and can capture the definition of an environment for that type either internally or externally:

{
  "environment": {
    "type": "conda", // or virtualenv or docker or ...
    "definition": {
      "external": "https://github.com/spam/eggs/blob/environment.yaml" // URI (local or remote) of a Dockerfile, environment.yaml, requirements.txt, some JAR, ... whatever is appropriate for the type
      "internal": "" // alternative to external, captures the content of the URI
    }
  }
}

Substitute whatever key names meet your fancy. (Naming is hard.)

Even without all of the types spec'ed up front, a human or tool can look for the environment object to get more information about what a notebook needs to run than is possible today. Over time, if there's agreement on what the set of types should be and what they should reference (e.g., type: docker -> a Dockerfile) then tools can be written to store the appropriate information.

parente commented 7 years ago

7 day bump: Anyone have thoughts about the approach in the last comment?

rgbkrk commented 7 years ago

Could internal be any JSON Object, string, array, etc.?

Now that I've stewed on it and you've bumped it, I'm leaning towards thinking this is a reasonable approach especially since you can specify an external resource to the notebook document.

parente commented 7 years ago

Could internal be any JSON Object, string, array, etc.?

In the strawman above, I noted that internal would be the captured content of external which would make it a string-only and somewhat opaque field. This is simple in that we don't have to spec out the data type of internal per "type" key and readers can always just treat it as a string. The downside, of course, is that readers need to parse the internal value if they don't want to treat it as opaque (i.e., something you just pass on to conda, or virtualenv, or docker).

westurner commented 7 years ago

How about external -> URL? and internal -> "data"?

@ type:

On Tuesday, October 4, 2016, Kyle Kelley notifications@github.com wrote:

Could internal be any JSON Object, string, array, etc.?

Now that I've stewed on it and you've bumped it, I'm leaning towards thinking this is a reasonable approach especially since you can specify an external resource to the notebook document.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jupyter/nbformat/issues/49#issuecomment-251415843, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGyzAnGuuiQ9_nHZeksQ6sI3nwLqHYks5qwms7gaJpZM4KDBAI .

parente commented 7 years ago

@ type:

  • EnvironmentSpecification
  • SoftwareEnvironment
  • ReproducibleEnvironment
  • ?

Do these map to an underlying implementation of those things (conda, docker, virtualenv, sbt, ...) that a tool can use to recreate the environment?

rgbkrk commented 7 years ago

The reason I ask about JSON is that conda maps well to JSON since it's YAML. I highly prefer indexing JSON documents that aren't double encoded.

westurner commented 7 years ago

On Tuesday, October 4, 2016, Peter Parente notifications@github.com wrote:

@ type:

  • EnvironmentSpecification
  • SoftwareEnvironment
  • ReproducibleEnvironment
  • ?

Do these map to an underlying implementation of those things (conda, docker, virtualenv, sbt, ...) that a tool can use to recreate the environment?

Not yet,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jupyter/nbformat/issues/49#issuecomment-251431337, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGy0xxo4O0GzjN7Lyrg4sKFqAKdd1pks5qwndxgaJpZM4KDBAI .

minrk commented 7 years ago

I don't love including the environment specification in the notebook document because it works very poorly once you have more than one document. But including a 'reference' to the environment file, such as environment_file: path or some such seems to make sense. I guess the question is how generic/abstract we want the information in the notebook to be. Of course, @parente's spec covers both cases, where external would be a 'source' of the spec that i like, and internal would be the (not double-serialized) spec for cases that value single-file sharing units above all else. I really think the internal version should be discouraged, though, as it makes things much less portable/standard/common/etc. than using existing standard files that all kinds of tools/services increasingly understand.

Yet another option, and one that doesn't need to be mutually exclusive, is to have git/textmate/binder-style specification based on directory location: i.e. if an environment.yml file is found in the current directory or any parent, use that as default.

Of course, we have two almost orthogonal questions to answer, here:

  1. what 'environment-specification' spec to we support
  2. what UI to we present to users

I think @parente's proposal, or something equivalent, where we only specify:

- type: string ('conda', 'pip', 'bundler', etc.)
- source: url (or spec)
- spec: [object, array, string] (or source)

makes sense for nbformat-blessed metadata, and the scope of this repo (I'll reiterate that metadata is a place to put arbitrary extensions, so people are welcome to experiment with metadata.ext_environment... while we hash out a standard here).

The only UI-related bit that might reach into what we do here is any potential changes to kernelspecs that this may have. e.g. is an environment declaration mutually exclusive with a kernelspec name? When creating a kernelspec from an environment, do we embed the environment specification in the kernelspec? If both an environment and kernelspec name are present but do not match, what should that mean?

The implementation/UI question is a can of worms, but can be mostly saved for later:

  1. how to we present to the user the ability to specify an environment?
  2. what do we do if we find an environment spec, but no kernel implements it?
  3. how do we determine if a kernel already provides an environment?
  4. how do we handle environment specifications we don't understand?
  5. how do we handle environment source URLs that don't resolve?

but none of those questions are really sensitive to the spec of how the environment is declared.

damianavila commented 7 years ago

Pinging people probably interested on this discussion as well: @bollwyvl, @janschulz, @Cadair.

jankatins commented 7 years ago

For a issue where the portability of notebooks form a notebook server with a conda kernel manager (and the absense of a pythonX default kernel) is discussed: https://github.com/Anaconda-Platform/nb_conda_kernels/issues/45

I find a conda environment (or virtualenv,...) quite nice, but there will always be portability problems (e.g. I tried to setup my notebook server in a conda environment without a python kernel -> no pythonX available in the notebook...).

bollwyvl commented 7 years ago

I am 👍 being able to store "enough" environment right in a notebook to reproduce it to the user's level of satisfaction. Never underestimate download/email/chat reproducibility: if you are looking at an nbviewer page for a notebook, there is no indication that it has a Dockerfile/requirements.txt/environment.yml next to it (though it could!). Even on binder, knowing how the environment you are looking at came into existence would be useful. But being able to just email someone a notebook and have it just work is a pretty important affordance... and as often as we can help people do magical things with compute without having to learn more file formats/tools is good. I.e. if you install via anaconda, it should be easy to command your environment (we've done that, sorta, with nb_conda/nb_conda_kernels/nb_anacondacloud). If you are running in Docker, even if you are also using conda, you should have the whole Docker(compose)verse available to you. Ideally, you should be able to do this without having to master the CLI/extra formats, as they distract from the science.

On-disk format: hate to keep saying it, but JSON-LD already has these kinds of things nailed: you can say that a field can be either an URI or it can be a sub-document. Put the right semantic meaning in, i.e. environment means jupyter:willRunInIfProvisionedWith, and you could have an ordered list of them that could also specify the platform for which it is targeted.

We could go, as @westurner obliquely hinted, down the linked data standards route... it turns out there is already a really heavy-duty way to do this, SPDX, though I have not experimented with representing conda/virtualenv/docker in them.

UI: I've hacked up various concepts for the jQuery notebook based on nb_conda_kernels that throws some ideas out in the context of an entry_point-driven generalization of the idea: not saying it's done, but it's something:

how to we present to the user the ability to specify an environment?

Basically, always offer the "simple" case, if you don't want to mess with envs, but once you pick it up then treat it as a UI tuple everywhere, of (Kernel Name, Env Type, Env Name). It would be nice if the providers could offer an icon (lil anaconda, lil whale).

what do we do if we find an environment spec, but no kernel implements it?

Didn't go down this road, but it seems like building a new env is thing that an extension could offer, but otherwise would just offer running in whatever you have. We have nb_conda which takes care of that, but it won't look inside a notebook and offer to extract an env embedded with nb_anacondacloud. Certainly from our point of view, we could do better on unifying those... and having a semi-blessed metadata format would help!

how do we determine if a kernel already provides an environment?

I feel like most env things can list what they have installed... again, this gets to the platform/arch issue...

how do we handle environment specifications we don't understand?

Heh, good question. Perhaps propagating a URL to the doc for the thing that created it? Or, treat that as a URI and use it in place of the simple string type. This would also give folks a natural way to support versioning of the env spec reader...

how do we handle environment source URLs that don't resolve?

I think resolving the URL would almost always require an end-user decision, even if successful, right? For example, if the env source is behind a corporate firewall, you might just give someone a big box to paste into and say, "go figure this out some other way". If it did work (since you are on VPN), you'd still want to preview it... Docker and virtualenv don't really have a "dry run" concept.

parente commented 7 years ago

Sorry I've been MIA for a bit after starting this. I think everyone is agreeing that we should get env information into the notebook, but now we just need to decide on the scope and format. I'll try to open a PR against the schema real-soon-now so we can more easily iterate on that part.

One thought about types:

- type: string ('conda', 'pip', 'bundler', etc.)
- source: url (or spec)
- spec: [object, array, string] (or source)

Mimetypes are being used and discussed elsewhere in Jupyter. What if type values here are mimetypes like application/vnd.dockerfile, application/vnd.virtualenv.requirements, etc.? This feels less hard-coded than pip, docker, etc.

As for handling unknown types, I think that's a UX decision that has to be made by tools that choose to handle this metadata (e.g., tell the user it's unknown, suggest a fallback alternative, Google it for them (jk), ...)

rgbkrk commented 7 years ago

I'd love for the types since they're external, to include a version number.