Open krassowski opened 3 years ago
Also I would love to hear back:
I guess a "Jupyter Notebook Format Language Server" is not a terrible idea, and I'd rather do something in our wheel house (of standards-based JSON) than more custom text integrations (e.g. jupytext).
As a text encoding, canonical notebook format JSON has:
so is certainly good enough for LSP-level pointers.
So now we would just need this server (which speaks concrete line/col numbers) connected to our squishy JSON document "good enough" to show:
The big thing holding us back from that is our Document vs Language Server cardinality:
This probably needs to be a full-on architecture PR with drawings :pouting_cat:
For text documents, this works okay, but we still want, for example, pyls and jedils, on the same document (with fine-grained tuning of which language server provides which feature).
ANYHOW, imagining we could: A rough demonstration of it today would be configuring the json-language-server
to use the nbformat JSON schema for *.ipynb
files. This would give you diagnostics on, e.g. execution_count
missing from a code_cell
, and probably some limited other features. It would not co-ordinate very well with the JSON-model based, but I'm pretty sure that could be made "less surprising," generally.
A more fully-built one would probably be built on pygls and nbformat directly, connected with the text representation of the document.
Yes this is a sensible alternative. However we may face a challenges of redefining LSP if we go for notebook format, for example all edit operations would need to use a different coordinates system (cell:line:column rather than line:column).
Also, the percent format would make expanding static analysis as easy as adding a plugin to pylint, black or whatever existing tools there is. For notebook format we would be building it from scratch, but we would only need to build it once, not for every language. Also, static analysis on the level of notebooks represented in JSON does not sound bad at all.
The big thing holding us back from that is our Document vs Language Server cardinality
yes, we would ideally support multiple servers per document and multiple servers per language.
different coordinates system
I think line:col
would work fine... this language server would just know how to do JSON-y kinds of things against the raw JSON text, and hand back either text-based diffs. Then we have a way to back them out to JSON diffs on the frontend. Or wherever makes sense. Might need to ensure there exists a byte-for-byte equivalent nbformat JSON encoder in typescript, but it seems like this should exist, anyway.
Let's say we did hoist the concept of cells (as a token type, for example): adapting efm
/dls
-style configuration would allow for cell-at-a-time or all-cells-in-a-document kinds of manipulation... or even pipelines, e.g. echo {} | marked | hunspell
.
@krassowski Thanks for tagging me. I'm a bit busy this month with my thesis, so I won't be able to follow the discussion closely, but I can send you a preliminary report on the Julynter experiment results (it is still under review) and answer any questions about Julynter.
Hi everyone, thanks for including me in the conversation! I am just discovering many of the subjects mentioned above, yet I'd be happy to help, and I am personally interested in the subject of helping notebook users following better coding practices.
I have no opinion of which is best for you, between interacting with the JSON format directly, or interacting with the script representation. Obviously interacting with the notebook itself is more direct. But interacting with the script may be reusable in other contexts. My personal experience is that I am glad to run black
and isort
on notebooks through jupytext
.
Also I understand that if you interact with the text version of the notebook, you want to be able to map this to the notebook context. We don't have yet a location translator (cell:line:column to line:column in the text file, and back), but that might be something that we could consider. Also, maybe you will want the translation between the two formats to be done in TypeScript (and thus, directly in the lab extension) rather than in Python - I don't feel that I have enough expertise to code that but I think that would be a very useful addition to Jupytext.
@bollwyvl I'm very new to this space, so forgive me if I have got this upside down.
The big thing holding us back from that is our Document vs Language Server cardinality:
basically, one "root" language server, derived by file type and/or kernel metadata
- any number of "nested" language servers
This probably needs to be a full-on architecture PR with drawings pouting_cat
For text documents, this works okay, but we still want, for example, pyls and jedils, on the same document (with fine-grained tuning of which language server provides which feature).
It sounds to me like the "issues" are as follows:
I wonder whether we need to establish the concept of "views" on the notebook - virtual documents that correspond to a particular language server. This would be performed by a backend component that transforms between the virtual-document interface of the language server, and the cell-based notebook of the frontend. Clearly, this is a one-to-many model.
I know that @krassowski has already touched upon this, but I think such a separation is important because it would facilitate things like polyglot notebooks. I don't know how easy this would be, but for simple polyglot support (reference global scope variables), the custom backend would be able to declare prototype values for inter-language variables in the virtual document, e.g.
var x = [1, 2, 3];
# prototype
x = []
In this example, the backend would produce two different documents (JS and Python) composed of interleaved cell contents and prototypes. This might be a complete pile of rubbish, but it sounds plausible so I'm running with it :laughing:
In Jupyter's case, this would also mean the user could perhaps have LSP features on Markdown code blocks. For example, if we treated Markdown code blocks as "self-contained" and "non-positional", then there would be three views: one for the concatenated Python cells, and one for each markdown block. For each markdown-block-view, the virtual document would consist of the concatenated Python cell contents + the particular markdown code block contents.
I imagine this gets tricky when handling richer features like rename-variable
. I suppose the solution would be to broadcast these sorts of events across each view, e.g.
def on_rename_definition(ctx):
for view in self.get_views_containing(ctx.cell):
self.send_lsp_event("rename-definition", view.id, ctx.location, ctx.token)
and then deduplicate the responses per-cell. Again, another big ":warning: I don't know what I'm talking about" warning
I know that has already touched upon this, but I think such a separation is important because it would facilitate things like polyglot notebooks. I don't know how easy this would be, but for simple polyglot support (reference global scope variables), the custom backend would be able to declare prototype values for inter-language variables in the virtual document, e.g.
To get you up to speed quickly: yes, we already do something like that ;) It is not perfect and frontend-side only, see extractors (which is a misnomer because they do not actually remove the content from the parent document). Currently each VirtualDocument remembers "foreign documents" (things that were extracted out of it), and there is one "root" virtual document which corresponds to a notebook. It does not have to stay this way (and the code is overdue a refactor anyways). We formally also refer to those as transclusions.
In Jupyter's case, this would also mean the user could perhaps have LSP features on Markdown code blocks. For example, if we treated Markdown code blocks as "self-contained" and "non-positional", then there would be three views: one for the concatenated Python cells, and one for each markdown block. For each markdown-block-view, the virtual document would consist of the concatenated Python cell contents + the particular markdown code block contents.
Yes, this is one of things we very much want to include in future see #19.
I imagine this gets tricky when handling richer features like rename-variable
Exactly; after trying to implement full-blown refactoring I came to conclusion that each cell has to have an identifier like a jupytext-style comment and each transclusion must have a similar anchor in the parent document allowing to encode all of its information. Having metadata information embedded in the comments could also allow for polyglot transclusion for @BoPeng's SOS (#282).
@krassowski instead of embedding this information in the virtual document contents, can't we maintain that information separately? For example, in my use case for literary, I have AST transformers that merge multiple cells together (the @patch
decorator for monkeypatching classes with top-level defined functions). Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).
I will refer to a collection of transclusions to form a virtual document as a "view" over a notebook.
For the existing functionality, a view would just need to maintain an internal one-to-one (non-surjective) map from the virtual document to a given cell location in the original notebook. Assuming we only needed line numbers for simplicity, this would just be
class NaiveView:
def build_virtual_document(self, nb):
lines = []
line_to_cell_location = []
for cell in nb.cells:
if not is_code(cell):
continue
for i, line in enumerate(cell.source):
lines.append(line)
line_to_cell_location.append((cell.id, i))
Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).
Could you expand on this one?
can't we maintain that information separately?
My hunch is that it won't be possible because the language server should be free to rearrange the code in any way it wants (reformat, move some lines to another editor, add some lines, etc) and it is free not tell us what it intends to do, but often it will just give us an end result (reformatted source file). Without having e.g. comments indicating cell breaks it becomes effectively impossible to map the modified source code back to the notebook cells (if there are no comments indicating where each cell begins).
For the existing functionality, a view would just need to maintain an internal one-to-one (non-surjective) map from the virtual document to a given cell location in the original notebook. Assuming we only needed line numbers for simplicity, this would just be
I used this strategy and originally tought it is a good idea, but it does not work with complex refactor/reformatting operations (because as mentioned above the language server is free to rearrange the inlined document in any way it wants without providing a map of which lines moved where).
Having in-virtual-document directives would become problematic if these transforms become finer grained (e.g. per token).
Could you expand on this one?
@krassowski I think perhaps I'm over-thinking this. I was conceiving of "transforms" that are not simple transclusions, and add tokens to the document. This isn't actually something any logical use of LSP should support on reflection.
I used this strategy and originally tought it is a good idea, but it does not work with complex refactor/reformatting operations (because as mentioned above the language server is free to rearrange the inlined document in any way it wants without providing a map of which lines moved where).
Right, so you want to keep some in-document anchors for which there will already be some structural stability (i.e. comments in most (all?) languages). The nice thing about doing this in a "Notebook adaptor" backend is that this is all just implementation details :)
What are you trying to do?
The Julynter experiment #378 demonstrated how notebook-specific IDE features could work. The ideas would include:
How is it done today, and what are the limits of current practice?
What is new in your approach and why do you think it will be successful?
In order to make the language servers (optionally) cell-aware I propose we embrace the jupytext percent format:
e.g.
# %%
for code cells and# %% [markdown]
for markdown cell. We could store the cell execution number in metadata (key="value"
part). We would allow the user to disable this feature. By embracing the percent format we could achieve some convergence with the Microsoft and PyCharm notebook extensions assuming they would like this idea, with the former important if we want to get this into LSP standard.As the comments differ in syntax between languages we should probably closely collaborate with jupytext (@mwouts) as they already handle this problem for a numer of languages.
Having the right format at the language server we would then:
On the frontend we would need to:
What are the risks?
It would be a shame if someone else develops a completely different standard without reaching out to us and we would have more fragmentation. Lets try to keep developers of other tools in the loop once the idea matures a bit and announce it on the appropriate forums of standard authorities.
How long will it take?
I have no spare resources to allocate to this myself as those are low priority to my research. It would be largely in a planning phase there until the next holiday where I will possibly have more time to implement this, or someone else who is interested picks this up earlier (contributions welcome).