Notebook support for `TextDocument`

dhruvmanila commented 11 months ago

Hello!

Thank you for adding support for Notebook related features in pygls!

I'm currently working on adding support for the same in ruff-lsp but currently facing a problem in how it's represented.

Context

Ruff has builtin support for linting and formatting Jupyter Notebooks. Now, especially for linting, we extract out all the Python source code, concatenate them, store some metadata (not relevant here) and pass the entire source code to our linter engine. Once we get the diagnostics, we map the location back to the Notebook as per the stored metadata and display it to the user.

The reason we require the concatenated source code from every cell instead of linting each cell independently is because of the context. If an import is defined in one cell and it's used in the other cell, we need to make sure that Ruff doesn't trigger an unused-import violation. This would only work if Ruff knows that the import is defined in another cell.

This means that Ruff expects the file representing the notebook in JSON format as per the specification. We pass in the file content through the Ruff LSP.

Problem

I hope the above context is enough to understand the problem but if not please do ask any relevant questions.

There are two documents now - TextDocument (pygls internal representation) and NotebookDocument (coming directly from lsprotocol.types module). The TextDocument is maintained by pygls and the source code is updated to reflect the one in the editor irrespective of the file content. This is important for performing linting on every keystroke as the file maybe not be saved yet.

But, with a NotebookDocument there's no internal representation. Now, as linting requires the JSON document, we can use the TextDocument representation using the notebook URI to do the same. This would mean that the TextDocument.source field points to the JSON string of the file. The problem comes with synchronization. The TextDocument won't be synced because it's only used to sync a Document and not a NotebookDocument. The way this could be temporarily solved is by re-creating the TextDocument on every change of the notebook:

@LSP_SERVER.feature(NOTEBOOK_DOCUMENT_DID_SAVE)
async def did_save_notebook(params: DidSaveNotebookDocumentParams) -> None:
    LSP_SERVER.workspace.remove_text_document(notebook_uri)
    text_document = LSP_SERVER.workspace.get_text_document(notebook_uri)
    ...

@LSP_SERVER.feature(NOTEBOOK_DOCUMENT_DID_CHANGE)
async def did_change_notebook(params: DidChangeNotebookDocumentParams) -> None:
    LSP_SERVER.workspace.remove_text_document(notebook_uri)
    text_document = LSP_SERVER.workspace.get_text_document(notebook_uri)
    ...

But, this would only work for DidSave event and not the DidChange event. The reason being that we would be working with the raw file content.

Possible solution

One solution could be to maintain the dictionary representing the JSON content of the notebook in the TextDocument which then should be synced as per the change events. The source property would then return the JSON string of the notebook which is always in synced with the editor. Or, it could just return the dictionary copy (to avoid accidental mutation by the user) and let the user decide if they want it as string or not.

alcarney commented 11 months ago

This means that Ruff expects the file representing the notebook in JSON format as per the specification.

Which specification is that sorry... I'm assuming nbformat?

Unfortunately, unless you convince the client to treat a notebook as a regular JSON file, I don't think it's possible to reliably access the raw JSON representation of a notebook - (someone please correct me if I'm wrong!). While treating a notebook as a regular TextDocument you can access the file itself, it will only work for file:// URIs - and you've already discovered this will not be in sync with the client.

Instead, the client sends us combination of TextDocuments (one for each cell) and a parent NotebookDocument which carries the metadata and overall structure (see spec), which hopefully is enough to reconstruct something that resembles the original JSON representation of the notebook.

Here I have a toy example that displays the reconstructed notebook JSON on hover

@server.feature(types.TEXT_DOCUMENT_HOVER)
def hover(ls: LanguageServer, params: types.HoverParams):
    cell_uri = params.text_document.uri
    notebook_doc = ls.workspace.get_notebook_document(cell_uri=cell_uri)
    if notebook_doc is None:
        return

    cells = []
    for notebook_cell in notebook_doc.cells:
        cell_document = ls.workspace.get_text_document(notebook_cell.document)
        cells.append(
            dict(
                cell_type=notebook_cell.kind.name.lower(),
                metadata=notebook_cell.metadata,
                source=cell_document.source,
            )
        )

    notebook_json = dict(
        metadata=notebook_doc.metadata,
        cells=cells,
    )

    return types.Hover(contents=f"```\n{json.dumps(notebook_json, indent=2)}\n```")

Which results in

Does that help you at all?

The current implementation is very much a 1:1 mapping of the LSP spec. I'm sure once we realise what server authors need pygls will grow some quality of life APIs to make some of this a bit easier :)

dhruvmanila commented 11 months ago

Which specification is that sorry... I'm assuming nbformat?

Yes, sorry for not mentioning that.

Does that help you at all?

I think this could help. Ruff mainly only requires the source code of each cells. But I see one problem - we use the nbformat and nbformat_minor to verify the version and maintain schema consistency. Actually, now that I think of it, it doesn't matter in this context as we aren't going to write the JSON back. We only require the diagnostics information (for the linter) or the formatted source code (for the formatter).

Thanks for spending time on this and providing feedback. I'll try this out today but feel free to close this issue :)

dhruvmanila commented 11 months ago

The idea of the server constructing the JSON string works. Thanks!

alcarney commented 11 months ago

No worries!

openlawlibrary / pygls