How to implement LSP for a multi-language kernel (SoS)?

BoPeng commented 4 years ago

Many thanks for your great work on language server support. I have just tried jupyterlab-lsp, which works great for Python and R, but unfortunately does not work for a multi-language kernel SoS that I have developed.

The idea behind of SoS is that it is a superkernel that sits between frontend and other kernels (see this illustration for details). It allows the use of multiple kernels in one notebook (through sos-notebook for classic jupyter and jupyterlab-sos for jupyterlab), and allows data exchange among live kernels.

The reason why jupyterlab-lsp does not work with SoS is simple: it does not know what language SoS is. If we are to solve this problem, there needs to be some way for SoS to notify jupyterlab-lsp the language used for each cell. I can work at both the frontend and backend (e.g. write a language server for SoS), but I am not sure if cell-level language support is at all possible with jupyterlab-lsp. I would appreciate any insight from the developers if and how this can be done. Thanks.

bollwyvl commented 4 years ago

Yeah, the Language Server Protocol Specification doesn't say anything about multi-language documents, so we're kinda shooting in the dark here. Further, basically 0 language servers care about Jupyter's JSON format, or any special syntax kernel authors have added on top of their host language(s).

Our detection is currently based on existing Jupyter approaches like file extension sniffing, contents manager introspection, or in the case of notebooks, the kernel and or notebook metadata. If everything is just sos, we can't do much for you, nor do we offer many hooks into this system at this point.

Presently, we do handle a small number of transclusions on the front-end, on a kernel-by-kernel basis, and it's rather deeply embedded inside the code. #191 discusses some approaches on how to normalize this, as a set of regular expressions + templates, or maybe some portable grammar and declarative transformation rules. If that's adopted, whether it's handled on the server side or the client side, there will be some hooks to extend it, ideally without having to rebuild the client (ha).

In a related effort, #268 (rough draft of an implementation on #278) suggests changing jupyter_lsp into a kernel, which handles all the management of language servers. If that approach is adopted, and your kernel supports kernel comms, you might be able to reuse the machinery there and offer your own solution... while that PoC presently treats the language server kernel as a singleton, it's important to me to not inject more "our way or the highway" pieces into the architecture: even for a single language server kernel implementation, it is important to be able to launch multiple instances that handle different documents, again without restarting your whole system.

However, as you've created a multi-language kernel with a special syntax, you've basically created a new language, which is certainly not unique: allthekernels, pidgy, metakernel are all in the same boat, protocol-and-bits-on-disk-wise. In all these cases, you might end up having to create a multi-language Language Server. There are a number of toolkits for different languages for doing so, e.g. pygls or vscode-languageserver-node, which might then in turn have to handle spinning up other language servers, as you really don't want to be writing all these things yourself. Costs aside, an investment in writing a Language Server can pay dividends through usability in any Language Server client.

Finally, there are also a number of upstream discussions occurring around this that may be worth your time to peruse:

krassowski commented 4 years ago

Just super fast thought from me: we may want to suport this case and it would be super easy if we settle on per-cell language definitions, but it requires a longer discussion and a consensu in the wider Jupyter community.

Will elaborate next weekend

rezaeir commented 4 years ago

@krassowski I think if lsp had this per-cell language definition it could work with SoS without much work on the SoS side because SoS kernel in a cell could be treated as another python kernel and its other functions don't have much of an overlap with lsp functionality. Am I wrong? @BoPeng

BoPeng commented 4 years ago

The problem could potentially be solved at the backend or frontend level.

If I am to implement an sos-language-server, it will of course try to start and use other language servers and act as a proxy. However, the language server protocol might not allow the passing of meta information to the server, so the sos language server might not be able to know the language of the content being passed. Hopefully the situation is not as bad as @bollwyvl said, "shooting in a dark".

It appears easier, and cleaner to implement this at the frontend level since jupyterlab-lsp is designed to work with multiple language servers anyway. It should be good enough for jupyterlab-lsp to know which language server to talk to at the cell level. SoS currently has some customized messages for changing cell level kernel (e.g. https://github.com/vatlab/jupyterlab-sos/blob/master/src/index.ts#L497), so it could be quite trivial, as @krassowski pointed out, if jupyterlab-lsp provides a hook/api for jupyterlab-sos to dynamically change the language of the kernel. I can work on a PR if this is allowed by the architecture, and acceptable to the team.

bollwyvl commented 4 years ago

However, the language server protocol might not allow the passing of meta information to the server,

lsp had this per-cell language definition

language server protocol might not allow the passing of meta information to the server

I wouldn't hold your breath trying to get changes into LSP! I may be very mistaken, but you'd have to make the case in such pitches very strongly that it would benefit microsoft and vscode pretty directly, and probably land some reference implementation there.

per-cell language definitions

While useful, this doesn't solve the larger problem of per-token transclusions, e.g. line magics, or query languages embedded in strings (#197). Further, this would probably require a breaking change to nbformat, and probably the jupyter kernel messaging protocol, neither of which like to be changed much.

so the sos language server might not be able to know the language of the content being passed

Assuming your files-on-disk can be statically analyzed by sos-language-server: the way it would work for a "pure" language server today:

user installs jupyter-lsp and sos-language-server
- sos-language-serverregisters itself for whatever file extensions, mime types, and codemirror modes you created for the language sos:
  - we support traitlets (e.g. jupyter_notebook_config.json) and setuptools entry_points
jupyter-lsp, would advertise the sos spec on its REST API
when a sos kernel session gets started, finding the sos declaration jupyterlab-lsp would open a new websocket for sos, to be used for all sos documents
- jupyter-lsp would start sos-language-server
  - we presently support stdio, but need to add TCP (e.g. https://github.com/krassowski/jupyterlab-lsp/issues/184#issuecomment-625798784)
jupyterlab-lsp would start the LSP session with initialize
- jupyter-lsp would proxy this and all messages verbatim to sos-language-server
- sos-language-server
jupyterlab-lsp would finish setup with configuration/didChange (#245), textDocument/didOpen, etc.
- sos-language-server would:
  - parse the sos syntax (with access to the whole file)
  - determine which actual language server should be started/configured
  - hopefully being able to reuse the configuration machinery from jupyter-lsp
  - start and send initialize to each of those languages
  - wait for all of those to start up
  - transform the messages coming back, potentially sending it over the WebSocket
  - one of the first messages received is usually textDocument/publishDiagnostics
  - finally, merge all those responses, and send it back

appears easier, and cleaner to implement this at the frontend level

That's your call: as an extension to an extension to an client, the stuff would "only" work with jupyterlab-lsp, and only with the version of jupyterlab we support, and therefore would need to be upgraded in pretty tight lockstep to the Lab version. No doubt you could write your stuff in such a way that the "guts" could be used in another client.

dynamically change the language of the kernel. I can work on a PR if this is allowed by the architecture, and acceptable to the team.

As I mentioned, have a look at #191. If, instead of requiring hacking a bunch of typescript (which, yes, we should of course allow, expose, and dogfood to implement any of the below), sos could do one or more of:

Put in A Folder some schema-constrained JSON, some nunjucks templates, or a portable grammar which gets exposed by jupyter_lsp
propose and implement a Kernel comm target, e.g. jupyter.lsp.transclusions which sos can use

..to mostly-statically describe "ways to transform code and into what language". The kernel-based approach could potentially offer said code transformation dynamically. This would support these concepts in a way that jupyterlab-lsp would only be a reference implementation, not the only implementation.

BoPeng commented 4 years ago

@bollwyvl Thanks for all the info. Let me dive into language server (protocol and implementation) and source code of jupyterlab-lsp before getting back to you.

krassowski commented 4 years ago

@BoPeng just wanted to let you know that I worked hard on restructuring the source code to make it more pleasant to look at. Also, potentially of your interest could be the improved cell-level syntax highlighting that we added here: https://github.com/krassowski/jupyterlab-lsp/pull/319. Please let us know if you are still interested in working on ridging SoS with jupyterlab-lsp - we are always happy to help!

BoPeng commented 4 years ago

Yes, this is on my TODO list, even relatively high, but I am swamped with other obligations (covid related projects, not surprisingly) and have not been able to work on this.

BoPeng commented 3 years ago

I had another look at the problem and it is likely a sos language server as @bollwyvl suggested is the best way to proceed. It would be a larger project than what my current bandwidth allows so it will take a while for sos users to make use of language servers.

krassowski commented 3 years ago

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

westurner commented 3 years ago

[...] why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

Are there any obstacles?

westurner commented 3 years ago

jupyterlab/debugger could/should/must also support multi-language notebooks. Are there similarities in implementation of the multi-language abstractions for LSP and for jupyterlab/debugger DAP support?

https://github.com/jupyterlab/debugger
- xeus-python is the only kernel that supports debugging so far
- ipykernel ~may also be supported soon~ also supports DAP: Debug Adapter Protocol https://github.com/jupyterlab/debugger/issues/274
- jupyter-debugger-protocol: https://github.com/jupyter/enhancement-proposals/blob/master/jupyter-debugger-protocol/jupyter-debugger-protocol.md

BoPeng commented 3 years ago

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

That will make things much easier for SoS. SoS currently uses kernel meta data to specify the kernel of each cell, but I am willing to change that to whatever will be used by jupyterlab-lsp.

BTW, congratulations on the merge of https://github.com/jupyter/enhancement-proposals/pull/72 !

denvesi commented 3 years ago

Okay, instead of creating sos-language-server, why don't we just use per-cell language-server as we already do with cell magics for IPython? This should be simple to implement.

@krassowski I would be interested in implementing this. I am a student and currently writing my master thesis and the project I am working on would benefit from supporting language servers. Unfortunately, the current state of the LSP plugin (if I understand it correctly) doesn't fit our use case, because we use multiple languages in one notebook. Per-cell language servers would solve this issue, so I would like to contribute. Though I am not the most experienced developer and I need to get a bit more familiar with the existing code, so a little guidance or at least general idea on how to solve this would be very much appreciated. :)

krassowski commented 3 years ago

You are very welcome to do work on it. I will be available to help and guide you if you run into any problems, though I may have longer response time than usual as next two weeks are very busy for me. I will try write up something with references to the code over the weekend.

denvesi commented 3 years ago

Thanks! That sounds great! It may take some time, because I am just at the beginning of my thesis, but I will try my best. Some references would be very helpful indeed.

denvesi commented 3 years ago

You are very welcome to do work on it. I will be available to help and guide you if you run into any problems, though I may have longer response time than usual as next two weeks are very busy for me. I will try write up something with references to the code over the weekend.

@krassowski Just a little update: I am still busy with some other parts of my thesis, but I'll have time to work on this issue soon. I know you're busy and I don't want to bother you, but I would really appreciate, if you could write a little guidance regarding the code and a general idea for solving the problem. That would help me a lot. Thanks in advance!

krassowski commented 3 years ago

Very quickly: on the relevant implementation level each cell (and file editor but this is not relevant) is represented by ICodeBlockOptions

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/virtual/document.ts#L39-L42

Code blocks are appended one by one by VirtualDocument.append_code_block():

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/virtual/document.ts#L611-L672

which calls VirtualDocument.prepare_code_block to extract fragments of code (which may be in different languages) which is actually implemented in VirtualDocument.extract_foreign_code to append the foreign code to the appropriate foreign virtual document:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/virtual/document.ts#L500-L555

There is also a notion of standalone snippets: even if consecutive cells use the same language, sometimes we do not want to merge them into the same virtual document (e.g. %%python magic which upon execution spawns a new interpreter so it is independent of any previous %%python magics); this is handled by:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/virtual/document.ts#L472-L498

Back to appending code blocks: ICodeBlockOptions does not pass any cell metadata (is not even aware of cell existence) - it only passes the value and the reference to the editor. To condition extraction of virtual documents on cell metadata this needs to be passed too. The actual append operations are executed in:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/virtual/document.ts#L934-L979

with these constructed from editors map in adapters:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/adapters/adapter.ts#L322-L335

which for notebooks are:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/adapters/notebook/notebook.ts#L243-L262

and for file editors there is only one editor:

https://github.com/jupyter-lsp/jupyterlab-lsp/blob/f67c880ca3187c4c41a07b30eaca6576ca9918c1/packages/jupyterlab-lsp/src/adapters/file_editor/file_editor.ts#L119-L121

krassowski commented 3 years ago

We have to make the information on cell metadata available to the code extracting foreign virtual documents, so it might make sense to generalize the editors() getter so that it returns an object which includes both CodeEditor.IEditor and metadata. We may want to have this as a separate getter and reimplement get editors() as a simple extraction from the result of that new getter for backward compatibility.

Or we may want to go in all-in and rewrite this code from scratch and release a new major version.

One thing I very much want to include is the reference to the cell (its identifier) as a comment in the virtual document content so that we can reliably translate back-and-forth between the virtual document and the cells, enabling full-blown refactoring as described in https://github.com/jupyter-lsp/jupyterlab-lsp/issues/467. It might or might not be beneficial to rewrite the virtual document to live on the backend, but I think that we should first try to implement it in TypeScript.

jupyter-lsp / jupyterlab-lsp

How to implement LSP for a multi-language kernel (SoS)? #282