nteract / papermill

📚 Parameterize, execute, and analyze notebooks
http://papermill.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
5.98k stars 429 forks source link

Kernel Gateway engine implementation #303

Open MSeal opened 5 years ago

MSeal commented 5 years ago

Implement integration with https://jupyter-kernel-gateway.readthedocs.io/en/latest/ as an engine option within papermill.

PhE commented 5 years ago

Does this feature mean papermill will be able to execute the notebook on an existing jupyter instance on another host ?

MSeal commented 5 years ago

Basically, yeah. No one has been working on the implementation though. I think most users find that local execution is where papermill has the strongest niche to fill in today's ecosystems.

jpugliesi commented 4 years ago

@MSeal Any update on the status of this? I don't fully understand papermill's architecture, and would like to understand why papermill's existing default engine cant run notebooks via the kernel gateway, namely since in Notebook 6.0, the NB2KG extension is built into jupyter - What prevents the existing papermill engine from running a notebook via kernel gateway, and what would the new engine have to change?

MSeal commented 4 years ago

I've been digging into kernel gateway work more recently. But essentially there's contractual deviations from the jupyter_client classes for communicating with the remote jupyter kernel that don't work out of the box with papermill. That NB2KG addition circumvents the code used in headless execution with it's own management process to account for remote concerns. I've been talking some with @kevin-bates about how to unify patterns a bit more but there's still low level work to be done. My guess is that the easier short term solution would be to make a kernel-gateway engine in papermill that accounts for the specific changes needed to launch nbclient with a kernel manager in a similar fashion as NB2KG does. I have not had (and probably won't for a long while) time personally to tackle this, though I'd be happy to review another contribution to making this work and I'll bring it up with the next server meetup I attend to see if anyone has interest in helping.

jpugliesi commented 4 years ago

@MSeal got it - and thanks for the quick response. For what it's worth, the Enterprise Gateway (EG) project recently merged a change that makes it simpler to execute EG kernels with nbclient (by using the RemoteKernelManager): https://github.com/jupyter/enterprise_gateway/pull/810/files

With this change, it seems it would be possible to simply pass the kernel_manager_class via kwargs to papermill.execute_notebook - does this seem reasonable?

MSeal commented 4 years ago

That is true that PR did get merged not so long ago!

You can set that flag by calling:

import papermill as pm
from enterprise_gateway.services.kernels.remotemanager import RemoteKernelManager

pm.execute_notebook('input.ipynb', 'output.ipynb', kernel_manager_class=RemoteKernelManager)

and it will propegate to the nbclient. I haven't tried this out personally, nor do I have a EG setup to point it to quickly, so if you'd like to try it out and report back that'd be appreciated. If it works well we could add it to our docs.

jpugliesi commented 4 years ago

@MSeal Ok, I just tried calling papermill with the kernel-manager_class specified, but it doesn't work. Here's the setup:


a kernel called spark_python_yarn_cluster is available in the Gateway

➜  docker exec enterprise-gateway jupyter kernelspec list
Available kernels:
  spark_python_yarn_cluster         /usr/local/share/jupyter/kernels/spark_python_yarn_cluster

locally, where papermill will be running, only a python3 kernel is available:

➜ jupyter kernelspec list
Available kernels:
  python3    /Users/jpugliesi/.pyenv/versions/3.7.5/share/jupyter/kernels/python3

I'm running the latest enterprise gateway (which should have the aforementioned change)

➜  curl http://127.0.0.1:8888/api
{"version": "6.0.3", "gateway_version": "2.2.0.dev0"} 

I've configured env variables to connect to the enterprise gateway

➜  env | grep JUPYTER
JUPYTER_GATEWAY_URL=http://127.0.0.1:8888
JUPYTER_GATEWAY_VALIDATE_CERT=false

Created a simple notebook, notably using the spark_python_yarn_cluster kernel that's defined in the gateway

➜ cat nb.ipynb
...
 "metadata": {
  "kernelspec": {
   "display_name": "Spark - Python",
   "language": "python",
   "name": "spark_python_yarn_cluster"
  },
...

Running this script to attempt to connect to the gateway via papermill:

➜ cat pm.py
import papermill as pm
from enterprise_gateway.services.kernels.remotemanager import RemoteKernelManager

pm.execute_notebook('nb.ipynb', 'output.ipynb', kernel_manager_class=RemoteKernelManager)

And it fails with the following trace:

➜  tmp python pm.py
Executing:   0%|                                                                                                                                             | 0/2 [00:00<?, ?cell/s]
Traceback (most recent call last):
  File "pm.py", line 4, in <module>
    pm.execute_notebook('nb.ipynb', 'output.ipynb', kernel_manager_class=RemoteKernelManager)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/papermill/execute.py", line 106, in execute_notebook
    **engine_kwargs
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
    return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/papermill/engines.py", line 343, in execute_notebook
    cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/papermill/engines.py", line 402, in execute_managed_notebook
    return PapermillNotebookClient(nb_man, **final_kwargs).execute()
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/papermill/clientwrap.py", line 36, in execute
    with self.setup_kernel(**kwargs):
  File "/Users/jpugliesi/miniconda3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/nbclient/client.py", line 425, in setup_kernel
    self.start_new_kernel_client(**kwargs)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/nbclient/util.py", line 72, in wrapped
    return just_run(coro(*args, **kwargs))
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/nbclient/util.py", line 51, in just_run
    return loop.run_until_complete(coro)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/nbclient/client.py", line 380, in async_start_new_kernel_client
    **kwargs))
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 303, in start_kernel
    self._get_process_proxy()
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 487, in _get_process_proxy
    process_proxy_cfg = get_process_proxy_config(self.kernel_spec)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/jupyter_client/manager.py", line 84, in kernel_spec
    self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
  File "/Users/jpugliesi/miniconda3/lib/python3.7/site-packages/jupyter_client/kernelspec.py", line 235, in get_kernel_spec
    raise NoSuchKernel(kernel_name)
jupyter_client.kernelspec.NoSuchKernel: No such kernel named spark_python_yarn_cluster

I don't know enough about nbclient or the RemoteKernelManager, but evidently papermill doesn't recognize the list of available kernels is provided by the Gateway, not the kernels available local to the papermill process. Any ideas as to why this may be?

MSeal commented 4 years ago

Hmm @golf-player or @kevin-bates might know offhand since they were involved with making/merging that PR more than I was. I'd have to dig into the code more to get a sense of why the kernel name was looking locally.

golf-player commented 4 years ago

What I worked on didn't really make it possible to use enterprise-gateway (EG) with nbclient/papermill. It made RemoteKernelManager (RKM) usable with nbclient. The subtle difference being that RKM is usually run inside EG, and hence uses the same kernelspec manager. But when RKM is run independently from EG, it'll have to provide its own kernels (which it'll get locally by default)

There's further caveats, like having to be running inside a kube cluster in order to be able to use the Kube Process Proxy.

I think the ideal solution here might be a compatible client class which uses the gateway-url rather than the EG internals. I was planning on doing that initially, but using RKM happened to be perfect for my use case.

kevin-bates commented 4 years ago

Correct. @golf-player's change was to allow a local installation access to the same kernels that EG provides w/o having to proxy the kernel's lifecycle management to the EG server. In addition, had there been a locally defined spark_python_yarn_cluster kernelspec, it would need to contain the endpoint information of a YARN resource manager node and require some amount of setup within the YARN cluster. That is, this isn't just some plug-and-play kind of thing (as much as we'd like to believe it is).

EG is a headless "kernel server". It can be hit directly via REST using the /api/kernels endpoints to start, monitor, interrupt, restart and shutdown kernels - local or remote to itself. When the --gateway-url is configured for Notebook or Jupyter Lab applications, that instructs the Notebook server to proxy all kernel management operations (including retrieval of kernelspecs) to the host specified in --gateway-url - which is expected to be a Gateway server - either Jupyter Kernel Gateway or Jupyter Enterprise Gateway.

RemoteKernelManager just extends the kernel manager of jupyter_client and abstracts the Popen process layer where "process proxy" implementations can be plugged in - each of which knows how to interact with its corresponding resource manager (YARN, Kubernetes, DockerSwarm, etc.). The identity of which process proxy is targeted is configured into the kernelspec file itself.

jpugliesi commented 4 years ago

@mseal @golf-player @kevin-bates thanks for the additional context. I'm interested in getting more involved to contribute, but this is my first encounter with the jupyter internals, so I'm definitely still getting my bearings. Any guidance/docs yall can provide towards understanding the internals would be much appreciated.

Re: @golf-player

I think the ideal solution here might be a compatible client class which uses the gateway-url rather than the EG internals. I was planning on doing that initially, but using RKM happened to be perfect for my use case.

Just to make sure I understand, by "compatible client class", do you mean an alternative to nbclient? How might this relate to the GatewayClient used in the notebooks project?

Overall, it sounds like running a gateway kernel via papermill requires more than just using a gateway-compliant kernel_manager_class - looking at the jupyter/notebook project, the Gateway kernel_spec_manager_class also needs to be used, possibly a session_manager_class, as well as notebook handlers?

Here's how notebooks configures these kernel management classes

if self.gateway_config.gateway_enabled:
            self.kernel_manager_class = 'notebook.gateway.managers.GatewayKernelManager'
            self.session_manager_class = 'notebook.gateway.managers.GatewaySessionManager'
            self.kernel_spec_manager_class = 'notebook.gateway.managers.GatewayKernelSpecManager'

The notebook project also configures some Gateway handlers as well

if GatewayClient.instance().gateway_enabled:
            # for each handler required for gateway, locate its pattern
            # in the current list and replace that entry...
            gateway_handlers = load_handlers('notebook.gateway.handlers')
            for i, gwh in enumerate(gateway_handlers):
                for j, h in enumerate(handlers):
                    if gwh[0] == h[0]:
                        handlers[j] = (gwh[0], gwh[1])
                        break

Again, looking to gain a deeper understanding of how things work and how I can possibly contribute. Appreciate your help!

kevin-bates commented 4 years ago

@jpugliesi - thanks for starting this discussion and offering to help - thank you.

The gateway integration that is embedded in the Notebook server purely instructs notebook to redirect its kernel-related operations to another server (a Gateway server). Once on the Gateway server - which is merely a notebook server "repurposed" in a headless manner - whatever kernel_manager_class that is configured "takes over". For the Enterprise Gateway server, that kernel_manager_class happens to be RemoteKernelManager - the same RMK that @golf-player has made "independent" for direct use by nbclient, etc.

((Note, the kernel_manager_class that you reference above is actually a MultiKernelManager subclass that essentially manages KernelManager instances, of which RemoteKernelManager is a subclass. This distinction is often unnecessary to call out, but if you were to start digging at things, could become confusing (as it is for all of us).))

golf-player commented 4 years ago

Sorry to add on to confusion here. When I said compatible client, I meant compatible KernelManager (like RKM).

If the idea is to make a kernelmanager which interacts with the gateway, a lot of things need to be build. A client like GatewayClient (which is an enigma to me as it seems to be unused everywhere), a kernel manager, and the other things you pointed out.

point of fact, which should show how confusing the MultiKernelManager vs KernelManager issue is: RKM is a subclass of KernelManager, not MultiKernelManager. I believe nbclient only works with KernelManager's interface, but I may be wrong. Furthermore, there's places in the codebase where MultiKernelManagers are referred to as KernelManagers, making the confusion more, so be careful with that distinction.

If you just want to run a remote kernel using papermill, you could do it the way I do, which is using RKM along with all the restrictions that comes with. Otherwise, you'd need to make a kernelmanager which entirely interfaces with the gateway api.

MSeal commented 4 years ago

there's places in the codebase where MultiKernelManagers are referred to as KernelManagers, making the confusion more, so be careful with that distinction.

I would really like to correct these and get it so the MultiKernelManager is well isolated from the KernelManagers. I think the naming being highly similar would be too difficult to change at this point. But if you could make issues (or just fix) places where MKM is referred to as a KM that would be quite appreciated.

If you just want to run a remote kernel using papermill, you could do it the way I do, which is using RKM along with all the restrictions that comes with.

I think having a RKM documented path would be best for the intended use case here. I would like to get to a point where jupyter_client has an RKM class that meets a few remote kernel needs as a base (possibly as an abstract class). I was starting this conversation with @kevin-bates, and this might be a chance to evaluate what to start on something more centrally supported, if possible, since there's some confusion in how things are organized today imo.

kevin-bates commented 4 years ago

I would like to get to a point where jupyter_client has an RKM class that meets a few remote kernel needs as a base (possibly as an abstract class).

Kernel providers resolve this and allow for anyone to do whatever they want so long as they adhere to the contract of discovery (which returns the equivalent of the kernelspecs) and startup (which returns the equivalent of the KernelManager interface).

MSeal commented 4 years ago

Potentially yes. But I do worry that proposal has not worked through ~a lot~ all of implementation details or complexities it would introduce as it currently stands, so it's not in a state where someone could just go implement it imo.

kevin-bates commented 4 years ago

Ok, fair enough. I suppose you're talking about current clients of jupyter_client and not about folks that want to write a kernel provider. There are a handful of examples of kernel provider implementations. But, you're right, clients of jupyter_client have to make changes to adopt providers and I suspect we could make that easier. However, since the kernels used by jupyter_client also work by default with providers, the two frameworks could co-exist - just not easily within the same application (nor is there any need to do that anyway).

golf-player commented 4 years ago

I think having a RKM documented path would be best for the intended use case here

FWIW, I've added some documentation for using RKM the way I do, but I don't think it's been added to the docs page

BobCashStory commented 4 years ago

@golf-player can we have a link to your documentation ?

kevin-bates commented 4 years ago

I just realized the EG doc builds are failing 😞 - fixing now. However, the doc update in EG (not sure if there was another here as well) can be found in this markdown file.

This feature will be included in the upcoming EG 2.2.0 release (2.2.0rc2 is currently available). Some sample kernelspecs that include process proxy stanzas can be found here: https://github.com/jupyter/enterprise_gateway/releases/tag/v2.2.0rc2

Questions relative to EG should be asked on our gitter channel or discourse forum.

BobCashStory commented 4 years ago

@kevin-bates thanks a lot ! when the release will be out ?

kevin-bates commented 4 years ago

EG 2.2 was released a couple of weeks ago. If you find any issues or have specific questions regarding EG/RemoteKernelManager, please open an issue in EG or post your question on the gitter channel.

Thank you.

atronchi commented 1 year ago

Seems relevant: https://github.com/elyra-ai/elyra/blob/main/elyra/pipeline/elyra_engine.py

"""Papermill Engine that configures a KernelManager to hit a Gateway Server."""