Notebook IDE features - Githubissues

jupyter-lsp / jupyterlab-lsp

Coding assistance for JupyterLab (code navigation + hover suggestions + linters + autocompletion + rename) using Language Server Protocol

https://jupyterlab-lsp.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.8k stars 147 forks source link

Notebook IDE features #467

Open krassowski opened 3 years ago

krassowski commented 3 years ago

What are you trying to do?

The Julynter experiment #378 demonstrated how notebook-specific IDE features could work. The ideas would include:

"empty cell" diagnostic
"cells execution not in order" diagnostic (obviously optional)
"cell with comments only could be a markdown cell" diagnostic
"remove empty cells" action
and many others (e.g. "ratio of markdown to code cells")

How is it done today, and what are the limits of current practice?

Currently there is no LSP-compliant way to implement such features.
Users need to install Julynter which does not integrate well with this extension yet.
The LSP specification has no notebooks support.

What is new in your approach and why do you think it will be successful?

In order to make the language servers (optionally) cell-aware I propose we embrace the jupytext percent format:

# %% Optional title [cell type] key="value"

e.g. # %% for code cells and # %% [markdown] for markdown cell. We could store the cell execution number in metadata (key="value" part). We would allow the user to disable this feature. By embracing the percent format we could achieve some convergence with the Microsoft and PyCharm notebook extensions assuming they would like this idea, with the former important if we want to get this into LSP standard.

As the comments differ in syntax between languages we should probably closely collaborate with jupytext (@mwouts) as they already handle this problem for a numer of languages.

Having the right format at the language server we would then:

prepare one or two extensions to the existing language servers to demonstrate how one would interpret the comments in static analysis and emit appropriate diagnostics.
encourage community to provide more such extensions
having a proof of concept maybe try to make it part of the LSP protocol
eventually consider packaging language servers with these extensions as defaults for jupyter-lsp (e.g. we could package version-frozen pyls with notebook features plugins as pyls-notebook or ipython-langauge-server)

On the frontend we would need to:

get the comment characters specific to the language:
- provided by backend from jupytext if installed, or
- use the ones from CodeMirror if not know for given language from backend
render the diagnostics on cells, for example by highlighting cells in the appropriate color; here the lessons learned by Julynter experiment (@JoaoFelipe et al) would be extremely valuable.

What are the risks?

It would be a shame if someone else develops a completely different standard without reaching out to us and we would have more fragmentation. Lets try to keep developers of other tools in the loop once the idea matures a bit and announce it on the appropriate forums of standard authorities.

How long will it take?

I have no spare resources to allocate to this myself as those are low priority to my research. It would be largely in a planning phase there until the next holiday where I will possibly have more time to implement this, or someone else who is interested picks this up earlier (contributions welcome).

krassowski commented 3 years ago

Also I would love to hear back:

what other diagnositcs or features specific to notebooks would be useful,
if we could re-use parts of Julynter in a way that would respect its unique contribution to the ecosystem, but not requiring users to install two extensions. Or maybe it should not be implemented at all and left for Julynter to handle?

bollwyvl commented 3 years ago

I guess a "Jupyter Notebook Format Language Server" is not a terrible idea, and I'd rather do something in our wheel house (of standards-based JSON) than more custom text integrations (e.g. jupytext).

As a text encoding, canonical notebook format JSON has:

a well-understood, expected mime type
a readily-identifiable file extension
a well-described line/whitespace/ordering format
a JSON schema

so is certainly good enough for LSP-level pointers.

So now we would just need this server (which speaks concrete line/col numbers) connected to our squishy JSON document "good enough" to show:

pylint diagnostics next to
JSON schema violations

The big thing holding us back from that is our Document vs Language Server cardinality:

basically, one "root" language server, derived by file type and/or kernel metadata
- any number of "nested" language servers

This probably needs to be a full-on architecture PR with drawings :pouting_cat:

For text documents, this works okay, but we still want, for example, pyls and jedils, on the same document (with fine-grained tuning of which language server provides which feature).

ANYHOW, imagining we could: A rough demonstration of it today would be configuring the json-language-server to use the nbformat JSON schema for *.ipynb files. This would give you diagnostics on, e.g. execution_count missing from a code_cell, and probably some limited other features. It would not co-ordinate very well with the JSON-model based, but I'm pretty sure that could be made "less surprising," generally.

A more fully-built one would probably be built on pygls and nbformat directly, connected with the text representation of the document.

krassowski commented 3 years ago

Yes this is a sensible alternative. However we may face a challenges of redefining LSP if we go for notebook format, for example all edit operations would need to use a different coordinates system (cell:line:column rather than line:column).

Also, the percent format would make expanding static analysis as easy as adding a plugin to pylint, black or whatever existing tools there is. For notebook format we would be building it from scratch, but we would only need to build it once, not for every language. Also, static analysis on the level of notebooks represented in JSON does not sound bad at all.

krassowski commented 3 years ago

The big thing holding us back from that is our Document vs Language Server cardinality

yes, we would ideally support multiple servers per document and multiple servers per language.

bollwyvl commented 3 years ago

different coordinates system

I think line:col would work fine... this language server would just know how to do JSON-y kinds of things against the raw JSON text, and hand back either text-based diffs. Then we have a way to back them out to JSON diffs on the frontend. Or wherever makes sense. Might need to ensure there exists a byte-for-byte equivalent nbformat JSON encoder in typescript, but it seems like this should exist, anyway.

Let's say we did hoist the concept of cells (as a token type, for example): adapting efm/dls-style configuration would allow for cell-at-a-time or all-cells-in-a-document kinds of manipulation... or even pipelines, e.g. echo {} | marked | hunspell.

JoaoFelipe commented 3 years ago

@krassowski Thanks for tagging me. I'm a bit busy this month with my thesis, so I won't be able to follow the discussion closely, but I can send you a preliminary report on the Julynter experiment results (it is still under review) and answer any questions about Julynter.

mwouts commented 3 years ago

Hi everyone, thanks for including me in the conversation! I am just discovering many of the subjects mentioned above, yet I'd be happy to help, and I am personally interested in the subject of helping notebook users following better coding practices.

I have no opinion of which is best for you, between interacting with the JSON format directly, or interacting with the script representation. Obviously interacting with the notebook itself is more direct. But interacting with the script may be reusable in other contexts. My personal experience is that I am glad to run black and isort on notebooks through jupytext.

Also I understand that if you interact with the text version of the notebook, you want to be able to map this to the notebook context. We don't have yet a location translator (cell:line:column to line:column in the text file, and back), but that might be something that we could consider. Also, maybe you will want the translation between the two formats to be done in TypeScript (and thus, directly in the lab extension) rather than in Python - I don't feel that I have enough expertise to code that but I think that would be a very useful addition to Jupytext.

agoose77 commented 3 years ago

@bollwyvl I'm very new to this space, so forgive me if I have got this upside down.

The big thing holding us back from that is our Document vs Language Server cardinality:

basically, one "root" language server, derived by file type and/or kernel metadata

any number of "nested" language servers

This probably needs to be a full-on architecture PR with drawings pouting_cat

For text documents, this works okay, but we still want, for example, pyls and jedils, on the same document (with fine-grained tuning of which language server provides which feature).

It sounds to me like the "issues" are as follows:

One notebook can have several different languages present
The notebook itself is a JSON document with its own schema

I wonder whether we need to establish the concept of "views" on the notebook - virtual documents that correspond to a particular language server. This would be performed by a backend component that transforms between the virtual-document interface of the language server, and the cell-based notebook of the frontend. Clearly, this is a one-to-many model.

I know that @krassowski has already touched upon this, but I think such a separation is important because it would facilitate things like polyglot notebooks. I don't know how easy this would be, but for simple polyglot support (reference global scope variables), the custom backend would be able to declare prototype values for inter-language variables in the virtual document, e.g.

var x = [1, 2, 3];

# prototype
x = []

In this example, the backend would produce two different documents (JS and Python) composed of interleaved cell contents and prototypes. This might be a complete pile of rubbish, but it sounds plausible so I'm running with it :laughing:

In Jupyter's case, this would also mean the user could perhaps have LSP features on Markdown code blocks. For example, if we treated Markdown code blocks as "self-contained" and "non-positional", then there would be three views: one for the concatenated Python cells, and one for each markdown block. For each markdown-block-view, the virtual document would consist of the concatenated Python cell contents + the particular markdown code block contents.

I imagine this gets tricky when handling richer features like rename-variable. I suppose the solution would be to broadcast these sorts of events across each view, e.g.

def on_rename_definition(ctx):
    for view in self.get_views_containing(ctx.cell):
        self.send_lsp_event("rename-definition", view.id, ctx.location, ctx.token)

and then deduplicate the responses per-cell. Again, another big ":warning: I don't know what I'm talking about" warning

krassowski commented 3 years ago

I know that has already touched upon this, but I think such a separation is important because it would facilitate things like polyglot notebooks. I don't know how easy this would be, but for simple polyglot support (reference global scope variables), the custom backend would be able to declare prototype values for inter-language variables in the virtual document, e.g.

To get you up to speed quickly: yes, we already do something like that ;) It is not perfect and frontend-side only, see extractors (which is a misnomer because they do not actually remove the content from the parent document). Currently each VirtualDocument remembers "foreign documents" (things that were extracted out of it), and there is one "root" virtual document which corresponds to a notebook. It does not have to stay this way (and the code is overdue a refactor anyways). We formally also refer to those as transclusions.

In Jupyter's case, this would also mean the user could perhaps have LSP features on Markdown code blocks. For example, if we treated Markdown code blocks as "self-contained" and "non-positional", then there would be three views: one for the concatenated Python cells, and one for each markdown block. For each markdown-block-view, the virtual document would consist of the concatenated Python cell contents + the particular markdown code block contents.

Yes, this is one of things we very much want to include in future see #19.

I imagine this gets tricky when handling richer features like rename-variable

Exactly; after trying to implement full-blown refactoring I came to conclusion that each cell has to have an identifier like a jupytext-style comment and each transclusion must have a similar anchor in the parent document allowing to encode all of its information. Having metadata information embedded in the comments could also allow for polyglot transclusion for @BoPeng's SOS (#282).

agoose77 commented 3 years ago

@krassowski instead of embedding this information in the virtual document contents, can't we maintain that information separately? For example, in my use case for literary, I have AST transformers that merge multiple cells together (the @patch decorator for monkeypatching classes with top-level defined functions). Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).

I will refer to a collection of transclusions to form a virtual document as a "view" over a notebook.

For the existing functionality, a view would just need to maintain an internal one-to-one (non-surjective) map from the virtual document to a given cell location in the original notebook. Assuming we only needed line numbers for simplicity, this would just be


class NaiveView:
    def build_virtual_document(self, nb):
        lines = []
        line_to_cell_location = []

        for cell in nb.cells:
            if not is_code(cell):
                continue

            for i, line in enumerate(cell.source):
                lines.append(line)
                line_to_cell_location.append((cell.id, i))

krassowski commented 3 years ago

Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).

Could you expand on this one?

can't we maintain that information separately?

My hunch is that it won't be possible because the language server should be free to rearrange the code in any way it wants (reformat, move some lines to another editor, add some lines, etc) and it is free not tell us what it intends to do, but often it will just give us an end result (reformatted source file). Without having e.g. comments indicating cell breaks it becomes effectively impossible to map the modified source code back to the notebook cells (if there are no comments indicating where each cell begins).

For the existing functionality, a view would just need to maintain an internal one-to-one (non-surjective) map from the virtual document to a given cell location in the original notebook. Assuming we only needed line numbers for simplicity, this would just be

I used this strategy and originally tought it is a good idea, but it does not work with complex refactor/reformatting operations (because as mentioned above the language server is free to rearrange the inlined document in any way it wants without providing a map of which lines moved where).

agoose77 commented 3 years ago

Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).

Could you expand on this one?

@krassowski I think perhaps I'm over-thinking this. I was conceiving of "transforms" that are not simple transclusions, and add tokens to the document. This isn't actually something any logical use of LSP should support on reflection.

I used this strategy and originally tought it is a good idea, but it does not work with complex refactor/reformatting operations (because as mentioned above the language server is free to rearrange the inlined document in any way it wants without providing a map of which lines moved where).

Right, so you want to keep some in-document anchors for which there will already be some structural stability (i.e. comments in most (all?) languages). The nice thing about doing this in a "Notebook adaptor" backend is that this is all just implementation details :)