MatthieuDartiailh commented 2 years ago

This is a heavy WIP aiming at replacing the ply based parser by one based on pegen (i.e. the PEG parser generator used by CPython itself).

By doing this I hope to:

be able to support seamlessly and will little effort new Python syntax (i.e. match)
have a clearer definition of the enaml syntax
improve error reporting with proper line and offset tracking

Pegen has a 0.1 release to which I participated but will likely need some extra work around error reporting (the Python generated parser does not yet mimic the C one used by CPython).

Also note that this will imply to drop Python 3.7

sccolbert commented 2 years ago

Hows the performance compared to the ply parser? (Im just curious).

On Sat, Jan 22, 2022 at 15:19 Matthieu Dartiailh @.***> wrote:

This is a heavy WIP aiming at replacing the ply based parser by one based on pegen (i.e. the PEG parser generator used by CPython itself).

By doing this I hope to:

be able to support seamlessly and will little effort new Python syntax (i.e. match)

have a clearer definition of the enaml syntax

Pegen has a 1.0 release to which I participated but will likely need some extra work around error reporting (the Python generated parser does not yet mimic the C one used by CPython).

You can view, comment on, or merge this pull request online at:

https://github.com/nucleic/enaml/pull/474 Commit Summary

b2acab0 https://github.com/nucleic/enaml/pull/474/commits/b2acab052290083f6af6470222ddd2a2feecc3f2 core: make enaml ast inherit from regular Python ast

e80dbb4 https://github.com/nucleic/enaml/pull/474/commits/e80dbb4ee4c5516dde791d0a69e7807de7b89544 core: wip on using pegen based parser

bbbf3d5 https://github.com/nucleic/enaml/pull/474/commits/bbbf3d56567c6886db10cc3279b380165c37b71b core: wip on using pegen in templates

9dc1c76 https://github.com/nucleic/enaml/pull/474/commits/9dc1c767438492891a3d27089c5ce9d9bff253c7 core: wip on using pegen in templates

917939d https://github.com/nucleic/enaml/pull/474/commits/917939d515100d78a0fa02587eb5ec5e82437791 core: update the grammar to be able to generate the parser

d5b353a https://github.com/nucleic/enaml/pull/474/commits/d5b353a71d0016cd5fd982975ba72a4e3b34d7f4 core: remove old parsers

fb3d3f0 https://github.com/nucleic/enaml/pull/474/commits/fb3d3f06bac94e881dd89f89ee5b98c6083155ce setup: update dependencies

3ebede8 https://github.com/nucleic/enaml/pull/474/commits/3ebede8ddc2e9afd5a1c280c7e3bc7b732b24e62 tests: wip on updating tests

cb08dca https://github.com/nucleic/enaml/pull/474/commits/cb08dca7a82a46aaed317a58bcd1897f3b550adb enaml: make parse backward compatible and introduce an optimized parse_file

cb2c9c0 https://github.com/nucleic/enaml/pull/474/commits/cb2c9c083a7644032235f653858b8c220cbba427 tests: ensure the parser is always up to date when running tests

a0331e3 https://github.com/nucleic/enaml/pull/474/commits/a0331e38c185d67ee3efa2a41fa5218115a5d7c4 core: wip fixing the new parser

File Changes

(23 files https://github.com/nucleic/enaml/pull/474/files)

M enaml/core/enaml_ast.py https://github.com/nucleic/enaml/pull/474/files#diff-1ea20743fa8bcac88e9f040aa531c2bae62c138b1263412744e323dcbf54a9b1 (70)

M enaml/core/import_hooks.py https://github.com/nucleic/enaml/pull/474/files#diff-994bb651f5f30527e7c579841c488858d288041c39bf5c9bed421ca632d3e15a (4)

M enaml/core/parser/init.py https://github.com/nucleic/enaml/pull/474/files#diff-d8084e70bc68e2df2b2913b0e09a2fdf1d843a42727a5daf8139e93be2ba5274 (116)

A enaml/core/parser/base_enaml_parser.py https://github.com/nucleic/enaml/pull/474/files#diff-de22c04619fc1e260f03a4682ce1b2b751704942e119934d8b961825ef25ce45 (269)

D enaml/core/parser/base_lexer.py https://github.com/nucleic/enaml/pull/474/files#diff-5a72400ae005719c4740e92d0ff396f5b22ee0cab8ff951b62e2e4dc371a5de9 (792)

D enaml/core/parser/base_parser.py https://github.com/nucleic/enaml/pull/474/files#diff-4f2d1f73ee236445c07b63d58f0f9d0a618a3cae4a5d1d4f4130bfe3bb2642e9 (3356)

A enaml/core/parser/base_python_parser.py https://github.com/nucleic/enaml/pull/474/files#diff-139e128d642805e0bacf94a10c814a6814d4e4ff73ec33be2342440419eb4d3f (344)

A enaml/core/parser/enaml.gram https://github.com/nucleic/enaml/pull/474/files#diff-f5bc5ee6df8f0f4b8245ffc3a1b2f0b850d8d473aa71d38ec6bff8601827f2de (1683)

A enaml/core/parser/enaml_parser.py https://github.com/nucleic/enaml/pull/474/files#diff-53a4049170af8a3369b330d29ce74af3f64b6d17c165310c437e1615251fcad1 (10286)

A enaml/core/parser/generate_enaml_parser.py https://github.com/nucleic/enaml/pull/474/files#diff-88dc4ebab3574d98c3299cc55cf1a3f5853ee2922782e63b615de93e1ae20717 (24)

D enaml/core/parser/lexer3.py https://github.com/nucleic/enaml/pull/474/files#diff-70ebd1f510ac77784d8e0ba3e814cee78ca4c726e3de41bc8b783b85543ea8ec (142)

D enaml/core/parser/parse_tab/init.py https://github.com/nucleic/enaml/pull/474/files#diff-3c5b787475a02b92bf3134d806fb5160b2119979082e643c05bd3f861b49b6cc (1)

D enaml/core/parser/parser3.py https://github.com/nucleic/enaml/pull/474/files#diff-90288132fa304cb79df34c5eac9f6949756ad37a874098bffd3a95415181085c (413)

D enaml/core/parser/parser34.py https://github.com/nucleic/enaml/pull/474/files#diff-9f15a5330c8b1c551743b031504edb8d4479ce79b078cabe285ee49975254f0a (41)

D enaml/core/parser/parser35.py https://github.com/nucleic/enaml/pull/474/files#diff-2df006b45cb3d5c08faba2cc23d1c6078bc9a4ba92b1a2804cef69a9d8e4097f (196)

D enaml/core/parser/parser36.py https://github.com/nucleic/enaml/pull/474/files#diff-c2b4fc4cba743c9e93a113f882ed6f5afd081977743c31e52e37ec421425cf11 (134)

D enaml/core/parser/parser38.py https://github.com/nucleic/enaml/pull/474/files#diff-49978de08e2cbc4528eec27700882b0c4876011653b5dc574b33ced22ddba1a5 (76)

D enaml/core/parser/parser39.py https://github.com/nucleic/enaml/pull/474/files#diff-85c70094a3cd116630e2e49ebb91f042c149ba8eb9046878dc555bc130af1daf (53)

M enaml/runner.py https://github.com/nucleic/enaml/pull/474/files#diff-cda7a40258c0a5649d2322ac21d787141aab56fbf59f69bf2ba51c2134837880 (17)

M setup.py https://github.com/nucleic/enaml/pull/474/files#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7 (44)

M tests/conftest.py https://github.com/nucleic/enaml/pull/474/files#diff-e52e4ddd58b7ef887ab03c04116e676f6280b824ab7469d5d3080e5cba4f2128 (18)

M tests/core/parser/conftest.py https://github.com/nucleic/enaml/pull/474/files#diff-61b67aa4be83ee12462bf18778d3a2f5acfa1e87f3e586d86f4eb2f1ef6021ab (16)

M tests/core/test_zipimporter.py https://github.com/nucleic/enaml/pull/474/files#diff-ec1e8e05b38b4080e086d13389d1ee7337438d78e2d1411b0880a7a4af1c62fd (8)

Patch Links:

https://github.com/nucleic/enaml/pull/474.patch

https://github.com/nucleic/enaml/pull/474.diff

— Reply to this email directly, view it on GitHub https://github.com/nucleic/enaml/pull/474, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBQSL6EI6L6ED5EPNU2J3UXMNNRANCNFSM5MSOKIVA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

MatthieuDartiailh commented 2 years ago

I did not measure it yet but I plan to do so once I get the tests to pass.

frmdstryr commented 1 year ago

Is anything blocking this? I've been working on getting KDevelop to support enaml and this parser sets the col_offsets.

Edit: After rebasing it fails on the zipimporter tests and two syntax error tests expecting a specific line number.

frmdstryr commented 1 year ago

I ran a quick benchmark using ipython's timeit on one of my projects. It was on 9 files loaded in in memory about 4k loc.

The existing parser is about 4-5 times faster. Both on python 3.10

from enaml.core.parser import parse
# read files into dict of filename, source
# then benchmark with
%timeit [parse(s, f) for f, s in files.items()]

Existing parser (current master)

296 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
302 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
290 ms ± 1.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pegen parser

1.31 s ± 5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.36 s ± 8.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.36 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

MatthieuDartiailh commented 1 year ago

So this is definitively worse than I had hoped for. Could you try to measure separately the cost of lexing and of parsing ? Currently we use the builtin tokenize but I have changes (yet to be merged into pegen) that would allow to use a different tokenizer and it would be great to know if we can win anything on that front.

I think that a pegen based solution in pure Python will always be slower due to how the parser work but I hope we can make the change more tolerable. It also means I should look again into the option of precompiling enaml files during install.

frmdstryr commented 1 year ago

I'm not sure if I'm doing it correctly but:

Existing parser

from enaml.core.parser import _parser

def lex(source, filename):
    lexer = _parser.lexer(filename)
    lexer.input(source)
    return list(lexer.token_stream)
%timeit [lex(s, f) for f, s in files.items()]

101 ms ± 3.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)def

Using tokenize used in the pegen branch

import io
import tokenizer
from pegen.tokenizer import Tokenizer

def lex(source):
    tok_stream = tokenize.generate_tokens(io.StringIO(source).readline)
    tokenizer = Tokenizer(tok_stream)
    try:
        while tokenizer.getnext():
            pass
    except StopIteration:  
        pass
    return tokenizer._tokens
%timeit [lex(s) for f, s in files.items()]

57.7 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the lexing part seems to be faster.

MatthieuDartiailh commented 1 year ago

It does not look obviously wrong to me. So once I am done with 3.11 base support I won't lack things to do when it comes to the parser.

frmdstryr commented 1 year ago

Well with the pegen parser, KDevelop can now at least highlight some of the variables properly. :) The current parser does not set the col offsets for any of the python expressions so everything highlights out of wack.

I would like to get it to somehow check all the attributes and understand the scoping but the ast does not really map directly to python and confuses it.

MatthieuDartiailh commented 1 year ago

I am honestly impressed you get that much to work. Improving tooling around enaml files is something I would like to pursue but I have no idea when.

MatthieuDartiailh commented 1 year ago

Another thing to check would be to regenerate the parser with the main branchbof pegen which includes better handling of error checks and some optimizations around unused variable. I wonder how much it influences performance.

frmdstryr commented 1 year ago

I got KDevelop to also find inherited and parent attributes now.

sccolbert commented 1 year ago

That's pretty cool!

This all has me wondering what a next-gen Enaml would look like if we owned the whole stack all the way down to rendering.

On Mon, Nov 7, 2022 at 1:17 PM frmdstryr @.***> wrote:

I got KDevelop to also find inherited and parent attributes now.

[image: image] https://user-images.githubusercontent.com/380158/200393848-d7f7978a-956d-4158-a279-2052e942fe17.png

— Reply to this email directly, view it on GitHub https://github.com/nucleic/enaml/pull/474#issuecomment-1306075658, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBQSKAIQT2ONKENJHBLG3WHFIVXANCNFSM5MSOKIVA . You are receiving this because you commented.Message ID: @.***>

MatthieuDartiailh commented 1 year ago

One project that reminds me of enaml (even though I did not dive deep enough) is https://slint-ui.com/releases/0.3.1/docs/rust/slint/docs/langref/index.html

MatthieuDartiailh commented 1 year ago

@frmdstryr could detail a bit how you managed to reach that point ? I am wondering if something similar for VS Code could be done.

frmdstryr commented 1 year ago

The code is in https://github.com/frmdstryr/kdev-python/tree/enaml-support . Their parser originally used the c-api but since that was removed in 3.10 I changed it to use the python api and was able to just add the enaml stuff on top.

Edit: If you try it, you'll need the pegen branch or it will crash as the col_offset is missing. It should work on windows but I haven't tried.

codecov-commenter commented 1 year ago

Codecov Report

Merging #474 (bb00f9d) into main (bcfb9b7) will decrease coverage by 1.37%. The diff coverage is 84.23%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #474 +/- ## ========================================== - Coverage 74.59% 73.21% -1.38% ========================================== Files 302 296 -6 Lines 22944 25810 +2866 Branches 2951 3648 +697 ========================================== + Hits 17114 18896 +1782 - Misses 4936 5829 +893 - Partials 894 1085 +191 ```

MatthieuDartiailh commented 1 year ago

@frmdstryr I will try to merge this by the end of the week so as to unlock work on 3.11 support. I will open issues for the remaining pain points (speed, compilation to enamlc at install time, etc) since my current bandwidth won't allow me to get to them in a reasonable time.

frmdstryr commented 1 year ago

Thanks Matthieu!

MatthieuDartiailh commented 1 year ago

With the last fix all exopy tests pass and since our tests pass too I will consider this good enough to merge.

nucleic / enaml

Pegen based new parser #474

Pegen has a 1.0 release to which I participated but will likely need some extra work around error reporting (the Python generated parser does not yet mimic the C one used by CPython).

Codecov Report