Implement parser for Sage

embray commented 4 years ago

The Sage interpreter has its own language that is close to, but a slight extension of Python, including various transformations to how literals--particularly numeric literals--are to be interpreted (there is also the issue of automatic variables which traditionally was only implemented by the Sage Notebook REPL itself, rather than in Sage itself; see e.g. #21959).

The Sage preparser which converts Sage code to valid Python code is an unwieldy series of regular expressions and other ad-hoc string parsing and transformation code. While it has been fairly stable over the years, it is difficult to maintain, especially when new syntax is added to Python (see #28974 and the difficulty of adding support for f-strings).

It would make the code much simpler and easier to understand, make it easier to respond to syntax changes in Python, and make it easier to evolve the Sage language (as a real language that is a superset of Python) by defining a formal grammar for it (again, as an extension of Python's grammar) and using a real lexer/parser in the Sage interpreter to convert code to ASTs, that can then be transformed into ASTs acceptable by Python's bytecode compiler.

This might be made easier by using Python's new PEG parser introduced in Python 3.9: https://www.python.org/dev/peps/pep-0617/ Though this does not necessarily mean making Sage dependent on Python 3.9, as we can generate our own parser using an existing third-party parser generator, or using the one that was written for Python and generates a C parser: https://github.com/python/cpython/tree/master/Tools/peg_generator

(If using this, one would also want to want to add a simple extension module providing an interface to the new parser, so it can be easily used by the Sage interpreter; there is example code for such an extension in the peg_generator package as well).

If it turns out extending the parser generator used by Python is infeasible, I believe Guido was also inspired by the TatSu parser generator; the current version of which requires Python 3.8, though earlier versions are Python 3.6+.

This would be a major task for anyone who want to take it on, though it would be an interesting project and I think highly valuable.

Summary of new syntax that would have to be supported by a Sage parser (distilled from https://doc.sagemath.org/html/en/reference/repl/sage/repl/preparse.html):

"raw" literals: numeric literals followed by r or R denoting that they should be interpreted as the Python built-in types instead of Sage types
generator syntax:
- g.0 is equivalent to g.gen(0) (if g does not have .gen method this results in a TypeError at runtime, or something)
- g.<g1, g2, g3> = G() is equivalent to "g = G(names=['g1', 'g2', 'g3]); g1, g2, g3 = g.gens()` (again this should also include some runtime type check that G is a Parent with generators)
implicit multiplication: a b c in an expression is equivalent to a * b * c (a feature that can be enabled or disabled, so this needs to be a flag in the parser whether or not to accept this)
- this also needs to support NUMBER '' term meaning <number> * <term>; e.g. 3exp(x) -> 3 * exp(x); this modifies somewhat the rules for terms in an expression since a term beginning with a number has different rules for a term not beginning with numbers
method calls are allowed directly on numeric literals (just method calls or attribute lookups as well?)
symbolic function definitions like f(x, y) = x + y^2
ellipsis notation like (need to expand on what these mean and their exact syntax):
- [1, 2, .., n]
- for y in (f(x) .. L[10])
- [1..5]
Backslash operator \\ (it is treated equivalent to multiplication in the order of operations, but has different semantics)

Anything I'm missing?

Already valid syntax in Python but with different semantics in Sage:

^ means exponentiation by default
numerical literals are Sage types (Integer, RealNumber, ComplexNumber, etc.)

30501: Define a Sage syntax highlighting

CC: @mwageringel @slel

Component: user interface

Keywords: parser, syntax

Issue created by migration from https://trac.sagemath.org/ticket/30760

embray commented 4 years ago

comment:1

Another useful resource is Guido's blog series on implementing a PEG parser for Python: https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60

embray commented 4 years ago

comment:3

Python's built-in tokenizer (whether the C version or the plain Python version) is not so easy to extend as I'd hoped either. Would be nice to have an explicit listing somewhere of exactly what would be needed to support Sage.

embray commented 4 years ago

Description changed:

--- 
+++ 
@@ -11,3 +11,38 @@
 If it turns out extending the parser generator used by Python is infeasible, I believe Guido was also inspired by the [TatSu](https://pypi.org/project/TatSu/) parser generator; the current version of which requires Python 3.8, though earlier versions are Python 3.6+.

 This would be a major task for anyone who want to take it on, though it would be an interesting project and I think highly valuable.
+
+---
+
+Summary of new syntax that would have to be supported by a Sage parser (distilled from https://doc.sagemath.org/html/en/reference/repl/sage/repl/preparse.html):
+
+* "raw" literals: numeric literals followed by `r` or `R` denoting that they should be interpreted as the Python built-in types instead of Sage types
+
+* generator syntax:
+
+  * `g.0` is equivalent to `g.gen(0)`  (if `g` does not have `.gen` method this results in a `TypeError` at runtime, or something)
+  * `g.<g1, g2, g3>` = G() is equivalent to "g = G(names=['g1', 'g2', 'g3]); g1, g2, g3 = g.gens()`  (again this should also include some runtime type check that G is a Parent with generators)
+
+* implicit multiplication: `a b c` in an expression is equivalent to `a * b * c` (a feature that can be enabled or disabled, so this needs to be a flag in the parser whether or not to accept this)
+
+  * this also needs to support `NUMBER '' term` meaning `<number> * <term>`; e.g. `3exp(x)` -> `3 * exp(x)`; this modifies somewhat the rules for terms in an expression since a term beginning with a number has different rules for a term not beginning with numbers
+
+* method calls are allowed directly on numeric literals (just method calls or attribute lookups as well?)
+
+* symbolic function definitions like `f(x, y) = x + y^2`
+
+* ellipsis notation like (need to expand on what these mean and their exact syntax):
+  * `[1, 2, .., n]`
+  * `for y in (f(x) .. L[10])`
+  * `[1..5]`
+
+* Backslash operator `\\` (it is treated equivalent to multiplication in the order of operations, but has different semantics)
+
+Anything I'm missing?
+
+Already valid syntax in Python but with different semantics in Sage:
+
+* `^` means exponentiation by default
+* numerical literals are Sage types (`Integer`, `RealNumber`, `ComplexNumber`, etc.)
+
+

mwageringel commented 4 years ago

comment:5

Replying to @embray:

Python's built-in tokenizer (whether the C version or the plain Python version) is not so easy to extend as I'd hoped either.

For tokenization, it might also be worth taking a look at parso. Its tokenizer is adapted from the Python version, but has some improvements. In particular, it supports tokenizing f-strings (which Python handles in an ad-hoc manner in a later phase).

slel commented 4 years ago

Changed keywords from none to parser, syntax

slel commented 4 years ago

Description changed:

--- 
+++ 
@@ -45,4 +45,6 @@
 * `^` means exponentiation by default
 * numerical literals are Sage types (`Integer`, `RealNumber`, `ComplexNumber`, etc.)

+Related:

+- #30501: Define a Sage syntax highlighting

lnay commented 1 year ago

I've been experimenting here with using tree-sitter APIs to create a SageMath to Python translater. It involves forking an existing tree-sitter grammar for python (in particular f-strings are dealt with already).

Little demo (currently just attempts to deal with numeric literals):

x = 2^4
z = 2.2*5
print(20r)
print(20r+1.2r)
print(f"{20r}")
print(f"{1/3}")

translates to

# This file was generated by python script using tree-sitter APIs
from sage.all_cmdline import *

x = Integer(2)^Integer(4)
z = RealNumber('2.2')*Integer(5)
print(20)
print(20+1.2)
print(f"{20}")
print(f"{Integer(1)/Integer(3)}")

The README on the repo has a bit more about why I think this is a good idea.

I'd be interested on thoughts on whether this could be the direction that could be taken for this issue.

If not, I'll likely still pursue this for the syntax highlighter in neovim.

sagemath / sage

Implement parser for Sage #30760

30501: Define a Sage syntax highlighting