Implement code overrides matching via tokens parsing

krassowski commented 4 years ago

What are you trying to do?

Add token based code overrides. The code overrides make the existing language servers work with non-standard syntax which is added by the kernels on top of the underlying language. For example:

IPython (a Jupyter kernel for Python) implements so called magics, which are prefixed by various special characters that otherwise would be an invalid syntax in the pure Python. An example IPython magic is %ls (variable names cannot start with % in Python, and prefixing magic names with % so IPython uses it to mark some of its magics). The IPython interpreter modifies the Abstract Syntax Tree, executing a special action when such a symbol is found: it calls ipython.run_line_magic('ls', '') where ipython represents the global IPython instance. This call returns a value and can have side-effects. It is possible to replace this magic with valid Python via the proposed code override mechanism, as the valid pure Python equivalent would be:

from IPython import get_ipython
get_ipython().run_line_magic('ls', '')

The reverse code replacement translates the above back to %ls. This is needed to make the LSP functions which modify the document work (e.g. rename/reformat/quick fix).

Some magics take variables as arguments. For example, %store data would save the value of data variable. A customised code replacement for this IPython magic would call a function using this variable, e.g.:

from IPython import get_ipython
get_ipython().run_line_magic('store', (data))  # mypy: ignore type check

This has several advantages:

if the users renames the data variable elsewhere in the code using LSP, the reference in the magic will be replaced too
the linter will see that the variable is referenced at least once and will not warn about unused variable
if the variable is not defined, a warning will be show

How is it done today?

We only support regular expression-based code overrides. This was initially a "band-aid" to make LSP work with IPython, but we are hitting the limitations of regular expressions.

What are the limitations of using regular expressions?

screening code blocks with multiple complex regular expressions is expensive
regular expressions cannot distinguish escaped code (not enough memory, fundamental grammar limitation)
- an example of code in which the magic cannot be correctly substituted: x['a = !ls'] = !ls (possibly more apparent on: x['\'a = !ls\''] = !ls)
  - explanation: the string fragment which does NOT contain a magic would be incorrectly substituted by regular expression:
    - we cannot just disallow matching ' and " as x['a'] = !ls is a valid expression with magic
    - we cannot start parsing from = as ! and % magics have to start from new line or assignment; for example x == !ls is not a valid magic;
  - every magic in a docstring (a multi-line comment in Python) is currently prone to incorrect substitution as regular expression is not context aware (the context being docstring)

The proposal

Each kernel should be able to provide a JSON (file or message) with the default code overrides, which would be defined in terms of tokens. With good token parser we can hugely reduce the computational complexity and work with syntax which has to be parsed as contextual grammar.
The users will still be able to install plugins providing extra code overrides for the libraries which define magics with additional side effects (such as assignment ot variables or reference of variables in the %store magic; for examples see rpy2 magics or ipython-sql magics).

Why should it be declared by the kernel? Should not the LSP servers work harder to adopt to the language variation?

the later could impede adoption of LSP, as writing your own LSP server is a challenge.
extending existing servers could be feasible, but not really... For example the Python servers would now need to install IPython to parse the code (this is ok); then for the pyls - which is jedi, black, mypy, pyflakes, flake8 etc bundled together - each of the underlying tool would need to be adopted to work with IPython - a huge task.
- adding a new tool to pyls would mean it had to be adopted for IPython
- adding a new magic for IPython would trigger a need for update in dozens of packages
there is a ton of custom user-defined magics which would benefit greatly from custom overrides; these would not be possible to be handled upstream, as most magics are not known about by the kernel developers, and even less so by the static analysis tool developers.

Now, I believe that magics are fun. Magics are what makes the IPython and others kernels easier to work with in the interactive setting. I would like to enable kernel developers, especially developers of kernels for languages which are not so well established (I do not expect to see huge changes in IPython magics) to add, change and refine magics without the fear that the cool language features will stop working for them.

Design notes

Before reading further, you may want to have a quick look at the current regular-expression-based implementation:

Design considerations:

it could be as simple as a three js function callbacks: find_matches, replace, reverse_replace
we would prefer something which can be serialize to JSON and provided by the kernel easily
we would want to offer DocumentStart and DocumentEnd tokens, so that we can prepend the document with necessary code. Regular expressions cannot do that as ^ is either beginning of the document OR of the line. This would allow, for example:
- in IPython import the things which are available as IPython built-ins, e.g. get_ipython() function
- in C prepend with int main() { and append return 0; }
we need to have a better way of tracking the substitutions; currently the reverse transformation leads to limitations on what we remember on the form of original expression. For example:
- two rpy2 magics calls, one with with -o and --output arguments will lead to the same result and will be always reversed to -o).
- similarly implementing %Rpull would conflict with the above
- int? and ?int are evaluated to the same code, so cannot be distinguished during reverse substitution
- therefore we want to have some kind of argument embedding in the comments; simple "lets embed the whole thing in the comment" is not a solution as we do want the code to be dynamic, variables referenced in magics to be properly renamed etc. A single system which can be used by every override would greatly simplify writing overrides. It could be encoded in comments with JSON (maybe collapsed to a single line) to be language agnostic. The actual responsibility of generating the comment would be left to the code override provider as what is considered a valid comment differs between languages.

Draft of the interface (to be converted to JSON schema once finished):

interface IArgumentStorageOptions {
   commentPrefix: string;
   commentSuffix?: string;
   /** Which characters should be escaped. */
   charactersToEscape: string[];
   /** Character(s) to be pretended before characters to be escaped */
   escapeCharacter: string;
}

interface ITokenType;

interface IToken {
  type: ITokenType;
  value: string;
}

interface ITokenMatcher {
  type: ITokenType;
  pattern?: string;
  /* If true, this is what should be extracted from the match (compare with regexp capturing groups) */
  capture: bool;
}

interface ITokenGroupMatcher {
   tokens: ITokenMatcher[];
  /** how many repetitions of the argument should be supported; can be Inf to support any number */
  repeats: number;
}

interface IArgument {
  id: string;
  match: (ITokenMatcher | ITokenGroupMatcher)[];
}

interface ITokenizer {
   // TODO
}

interface IArgumentMatch {
   id: string;
   // how to join values if multiple captured under id
   join?: string;
}

interface ITokenCodeOverride {
  tokenizer: ITokenizer;
  argumentStorage: IArgumentStorageOptions;
  match: (IToken | IArgument)[];
  replace: (string | IArgumentMatch)[];
  reverse: ITokenCodeOverride;
}

Example IPython magic for %store:

{
    tokenizer: python,
    argumentStorage: {
       commentPrefix: '#'
    },
    match: [
      {type: 'operator', pattern: '%'} as ITokenMatcher,
      {type: 'variable', pattern: 'store'} as ITokenMatcher,
      {
         id: 'store-argument',
         match: [ 
            {type: 'variable', capture: true} as ITokenMatcher,
            {
                tokens: [
                    {type: 'separator'} as ITokenMatcher,
                    {type: 'variable', capture: true} as ITokenMatcher
                ],
                repeats: Math.Inf
            } as ITokenGroupMatcher
         ]
       } as IArgument
   ],
   replace: [
      "get_ipython().run_line_magic('store', (",
      {id: 'store-argument', join: ', '} as IArgumentMatch,
      "))"
   ],
   reverse: {} // TODO
}

Questions to consider

should we try to build our own ITokenizer, or should we:
- re-use tokenizers from CodeMirror? It is probably the easiest solution as it gives us implementation for most languages at hand. The con is that other frontends would need to play along - but they need to use a tokenizer, why not to stick to this one? Ideally there would be a package/standard which is about the tokens only and focuses on compliance with the language specification rather than on visuals/editing usability as the CodeMirror modes/tokenizers do...
- use ANTLR (they got a testimonial from Guido van Rossum among others ;)) which is more dedicated tool for grammar specification and parsing; I have not looked in depth but it might be giving just a syntax tree rather than linear tokens which could be an overkill for we are trying to achieve here.
ideally the tokenizer would need to be specified once and then only referenced by id.
it would be nice of us to make a solution which could be adopted by other front-ends should they wish to, to prevent fragmentation of the ecosystem.

How to make this work for kernel developers?

Testing overrides would need to be made easy for the kernel developers.

we could host a website on GitHub pages with a playground where the developer provides text and JSON with ITokenCodeOverride and overrides are presented.
- we could define an ITestCase interface and accept that ITestCase[] on the static website to give developers a hint on how it all should work like
for their use in CI we could make a conda package with a test runner accepting ITokenCodeOverride and ITestCase[]

Will this work?

Yes, this will technically work
Not sure - will kernel developers in Jupyter community be willing to go this route, or would they prefer that a LSP server is implemented for each kernel / existing language servers are amended for each kernel?

blois commented 4 years ago

Are there languages other than Python that have cell magics? This almost feels like an implementation detail of the LSP that it should be supporting 'Notebook Python' but the underlying implementation only supports Python.

Could cell magics be implemented via an extension to the existing Python LSP engines?

will kernel developers in Jupyter community be willing to go this route, or would they prefer that a LSP server is implemented for each kernel / existing language servers are amended for each kernel?

I'm of the (possibly unpopular) opinion that magics introduce significant unexpected tooling overhead. This includes lsp, syntax highlighting, lint, format and graduating code to files). Languages should think very hard before adding them.

krassowski commented 4 years ago

Thank you for chiming in @blois. I respect your point of view, and agree that if there were no magics, the overhead in maintaining the tooling would not be as high.

However, as a happy user of IPython I could not see it without magics; I use them heavily everyday and I believe they are a huge contributor to the IPython success as compared to some other kernels that explicitly decided not to support magics (excluding kernels of languages which do not need magics as the syntax of the language is very flexible in itself).

Could cell magics be implemented via an extension to the existing Python LSP engines?

No, I believe that implementing magics in LSP servers would not be realistically possible. In addition to the reasons listed in Why should it be declared by the kernel? Should not the LSP servers work harder to adopt to the language variation? (above):

there are dozens of LSP servers for each language. A Python user might like pyls today as it is open source, but tomorrow want to switch to python-language-server from Microsoft, and a day after that to pylance (also from MS). The last one is not open source, but probably the most advanced one. How would you implement magics into a closed-source server?
there are dozens of kernels which use similar base magics for different languages; the LSP server can be written in these languages and I just cannot imagine how one would go implementing magics in each of these servers to cater specialised kernels with user base in thousands rather than millions. In my view these kernels are a vital part of the Jupyter ecosystem which is open to to diversity and niche needs required in specialised areas.
- the whole point of having Metakernel, or now xeus was to make creating kernels easy for anyone
- the whole point of LSP is to solve n:m problem (n editors need to adapt to support m languages)

Are there languages other than Python that have cell magics

Kernels that support magics in general:

IPython (obviously): line magics, cell magics, shell expressions (bang!), pinfo (? and ??), last output saved in _ (etc)
IScala supports magics
xeus-cling (C++ kernel) supports both cell and line magics
xeus-python supports both line and cell magics
IPolyglot is build around cell magics; magics are essential to the existence of this kernel
Perl 6/Raku kernel kernel supports magics (#% and %% syntax)
coalac and embedded SageMath kernels extensively support magics, but renamed them to "modes" to make it easier to understand for users (I don't think it is easier for non-native speakers but that's a fair choice)
IPyStata is build around cell magics; magics are essential to the existence of this kernel
All kernels based on Metakernel support magics:
- Basic set of line and cell magics for all kernels.
  - Python magic for accessing python interpreter.
  - Run kernels in parallel.
  - Shell magics.
  - Classroom management magics.
- Tab completion for magics and file paths.
- Help for magics using ? or Shift+Tab.
- Plot magic for setting default plot behavior.
- this includes: matlab_kernel, octave_kernel, calysto_processing kernel and dozens more.
@SylvainCorlay proposed that xeus could make some magics (e.g. %%timeit) available to all xeus-based kernels (https://github.com/jupyter-xeus/xeus-python/issues/63#issuecomment-556023719)

Kernels that do not support magics:

IJulia - tolerates but does not support (i.e. IPython magics are essentially help commands to make things easier to users switching over from IPython)
IRKernel due to a design decision does not support magics, even though users repeatedly asked for magics support
IElixir only supports magical namespace population (which can be addressed by token parsing as we can introduce 'cell-end' token with value specifying cell number); this is akin to IPython _ variables.
IRuby does not support and did not plan on supporting magics as of September 2019
IHaskell supports ihaskell-directives, effectively being line magics equivalents; they do not support defining custom magics, cell magics not shell expressions (as those are kind-of easy to get with command-qq), but they are open to consider cell magics support in the future.

I did not look specifically for cell magics - why are you focusing on cell magics only?

blois commented 4 years ago

I agree on the usefulness of magics, really trying to figure out a way to break the problem down some.

For moving support into the LSP- I wonder if it could be done with one LSP wrapping another and doing the translation at that level. Some of the implementation would be tricky but I'm not sure it's much different than what needs to be done at the editor level. It sounds like VSCode may be trying something along these lines for their notebook support? I see no mention of supporting magics there though.

jupyter-lsp / jupyterlab-lsp