Closed Bugz000 closed 7 months ago
Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.
Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.
you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol
i cannot offer a smaller example as i think the problem resides in the size of the grammar paired with the size of the input text, and due to the nature of my recursive execution, all grammar must have their respective functions, there is only exactly what is absolutely necessary to demonstrate the issue provided
i suppose you could remove the whole execution section but i also leave this to demonstrate it is infact the tokenisation process that is the problem, and the parsing of the resulting json tree is indeed very fast, so there's little reason for it to be SO slow...
however feel free to delete the ASTexecutor and VARIABLE classes and the respective call at the bottom if you don't want to see that then you will truly have the bare minimum
i have edited the post and added more context
i have removed astexecutor and variable classes, now it is minimum though it cannot be verified anymore
you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol
Tip for the future: You can preview your messages before posting. Editing messages afterwards is a bit less optimal since I at least always receive an email first. Having lots of poorly formatted code in the email, hiding also the button to go the github message behind a wall of text is annoying for everyone else.
You are using dynamic_complete
. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.
And yes, earley has a time complexity of O(n^3)
for highly ambiguous grammars. But unless you are trying to parse a natural language, you don't need that. Try using parser="lalr"
and strict=True
and fix the problems that come up one by one.
You are using
dynamic_complete
. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.
running the same thing in basic creates error:
Error: + at line 2, column 28.
print(round(strlen("reee") + 500+ (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500))) " is indeed a number!")
^
####################################
None
####################################
Expected one of: {'BOOLEAN', 'STRING', 'LSQB', 'NUMBER', '_WS', 'CNAME', 'LBRACE', 'LPAR'}.
Previous tokens: None
and editing the line to something simple
Error: := at line 2, column 5.
var := 5+5+5
^
####################################
None
####################################
Expected one of: {'LSQB', 'LBRACE', '_WS', 'NUMBER', 'STRING', 'CNAME', 'BOOLEAN', 'LPAR'}.
Previous tokens: None
relevant lines in grammar
start: statements
statements: (statement _NL)* statement
statement: _WS* _statement? _WS* LINE_COMMENT?
_statement: MULTI_LINE_COMMENT
| COMMAND
| statement_loop
| statement_if
| statement_function_definition
| statement_return
| statement_assignment
| expression
| statement_exp_inc_dec
| statement_inc_dec
statement_assignment: (variable|expression_array) _WS* OPERATOR_ASSIGN _WS* expression
variable: CNAME
expression_array: CNAME "[" array_list "]"
OPERATOR_ASSIGN: ":="
_WS: /[ \t]/
expression: _WS* expression_ternary _WS*
// continue to the expression tree
The default is dynamic
, not basic
. It seems like the basic
lexer has some other bug that I am currently trying to figure out.
trying dynamic it completes testing dynamic_complete on around 700 lines of var := 5+5+5
assigning grammar
tokenizing
--- 31.171510457992554 seconds ---
testing dynamic on around 700 lines of var := 5+5+5
assigning grammar
tokenizing
--- 31.176007747650146 seconds ---
zero change
this didn't seem right so sanity check i ran the tests again and ensured i am indeed editing and running the correct file etc dynamic_complete
tokenizing
--- 32.824604749679565 seconds ---
dynamic test 1
tokenizing
--- 25.929264545440674 seconds ---
dynamic test 2
tokenizing
--- 28.417376279830933 seconds ---
dynamic test 3
tokenizing
--- 27.651228427886963 seconds ---
so about 5% increase in speed (sometimes)
Aha, OPERATOR_CONCAT
is the problem for lexer='basic'
. Having space be a valid operator is always going to cause problems for fast tokenization scheme.
Aha,
OPERATOR_CONCAT
is the problem forlexer='basic'
. Having space be a valid operator is always going to cause problems for fast tokenization scheme.
that sounds about inline to what i had to deal with, one of the best features of AHK (and it's quite unconventional) is it's concat with space
print(var " is stored inside the variable") is valid, no . or + concat, and everything is string, float, string int all in strings, so it's very convenient to use, just throw the data it will handle it
however space concat is by far the hardest thing to get working which is ultimately what made me use earley over LALR and (probably) dynamic_complete over basic (though dynamic seems fine too)
i'm unsure if LALR is even possible for this syntax
to be clear nobody has tokenised ahk like this before, all the ahk source codes, _B, _L, v2, _H and ironahk etc all operate the syntax by hammering away at it with raw string funcs, (frankly this is insanity to me) so i wish to tokenise it, "as closely to original syntax" but i'm willing to make concessions for some form of standardised i/o...
i hope you can see what limitations this brings
Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr
as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.
However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/
, which would be much much better as _WS: /[ \t]/+
Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use
expr expr
as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.
i think i tried this but i ran into problems where
print( var (5+5)) ; would resolve to
print( var(10) )
"function var not found" (or vice versa based on python hash seed or something) the vice versa being
print(var (10)) ; correctly
never been gaslit by a terminal before but lark pulled that off for a full 2 days somehow
However, I would guess most of the slowdown is actually coming from
_WS: /[ \t]/
, which would be much much better as_WS: /[ \t]/+
test with that whitespace, and dynamic:
tokenizing
--- 26.913031339645386 seconds ---
comparable
You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.
test with that whitespace
What did you test?
Did you change all the WS* into WS? I think that is causing the slowdown.
You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.
i'm unsure how to do this but i will read on it
test with that whitespace
What did you test? I'd appreciate it if you used full sentences.
the whitespace you provided. i didn't think it needed more clarification given there was two options, one of which was current, but there you go
_WS: /[ \t]/+
i tested with this whitespace change that you suggested
Did you change all the WS* into WS? I think that is causing the slowdown.
An unexpected error occurred: No terminal matches '
' in the current parser context, at line 1 col 1
^
Expected one of:
* _WS
now we run into new line issues which is frustrating, i've had this newline problem before aswell
An unexpected error occurred: No terminal matches 'v' in the current parser context, at line 1 col 1
var := 5+5+5
^
Expected one of:
* _WS
removing the first blank newline of the code it shows more inherent problems with changing _WS* to _WS
That's because it should be sometimes _WS
and sometimes _WS?
. You'll have to do it yourself based on the context. But that should help the parser.
so we establish streamlining the grammar helps, and it does, marginally, i estimate we could see a 10-20% drop in speed with full optimisation on that front
a further 10x faster speed switching to LALR but i'm not that clever
but remember my options are:
i have 32 cores available, meaning we could very easilly gain a 3200% increase in speed with parallelisation, this has not been discussed once yet
remember, we're not looking for elegance here, this is nothing fancy it just needs to be somewhat functional, but 2 minutes to tokenise 700 lines is frankly ridiculous
but as you know there's not a problem in the world that can't be solved by throwing more compute at it
how would this be achieved?
the code can be as slow as it wants to be but it's using only 5% of my cpu, with parallelisation, any improvements would also be multiplied by core count, i think this should be the priority currently
i read that it is "trivial" to parallelise earley parsing but i'm unsure how, short of using an earley parser to split the code and launching the split code snippets as their own comprehensive parser or something, there is little to no documentation of paralellising lark
The easiest option to get it to run faster is to use PyPy instead of CPython.
I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?
First of all: tokenization is only a small part of parsing and since switching from dynamic_complete
to dynamic
changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.
Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.
The easiest option to get it to run faster is to use PyPy instead of CPython.
i tried this, it won't use lark for some reason, just keeps saying "module not found"
I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?
i read it here https://news.ycombinator.com/item?id=28259458
LL or LR is almost certainly faster then Earley, since in general more restricted = faster. Earley's algorithm should be able to be trivially parallelized
First of all: tokenization is only a small part of parsing and since switching from
dynamic_complete
todynamic
changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.
i was hoping there was some magic way to get it working but it appears it is simply just as hard as it seems, so i guess parallelisation is out the window besides tokenising seperate files at the same time
however this person was doing something with multi threading https://github.com/lark-parser/lark/issues/493
Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.
so how would you propose further improving this, can you see a way this can work in LALR? as this can gain an instant 10x speed increase (judging by online benches)
@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.
Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.
@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.
there's no need to get defensive, you also said earlier that i should "use full sentences" but edited it out... if you are getting upset because i simply stated the facts of the matter then that is not my problem but i thankyou for your input regardless... however anemic said input was
Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.
again with the sarcasm, there is no need to get upset about this... i'm not continuing this reply it's clear you have run out of constructive things to add and are now turning me into some strawman to attack rather than accept you don't know the solution.
has anyone else some constructive things to add, without a generous dollop of attitude ontop? i feel this guy has derailed the thread a little
if you are getting upset because i simply stated the facts
You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong. But apparently you are unwilling to get help with that so ...
i feel this guy has derailed the thread a little
"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes. The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.
You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong.
i pass the python file to the exe as an argument... what else is there to do. it was my understanding it is "plug n play"
But apparently you are unwilling to get help with that so ...
where did i say this
"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes.
Dev or no dev, an attitude problem is an attitude problem, there's a nice way to say everything than treat everyone as if they're stupid? i have been nothing but cordial and polite. any animosity is their own, and only their own. i have taken no part in what whatsoever. in short; their problem they are upset.
i do not demand magical fixes, i was simply asking suggestions and places to learn, i got a couple, while being treated like i'm stupid, then this final "i've lost interest" if i may give some advice in return; frankly i think you lost interest in computers a long time ago.
The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.
got it, neither of you know what's wrong, closing thread.
got it, neither of you know what's wrong
I know exactly what is wrong. You make mistake after mistake, but you're too strong-headed to admit you might be wrong.
I lost my patience because you kept blaming the tool for your own mistakes and shortcomings, you made crazy claims like "it is "trivial" to parallelise earley parsing" and then ignored me when I challenged you.
Sorry, but you're not in a position to give advice.
Best of luck with your project.
TLDR below
this is my first time making a language, i have no formal education on the matter so beware i use leyman terms for all of this but i've been toying with this idea for around 20 years at this point
i am making an earley parser for "AHK-LIKE" syntax, it is not intended to be fully ahk compatible however, but this will establish a lot of context at least
it works great but i found my tests were getting slower and slower, only taking about half a second to complete but noticable for the small script i was running so i decided to run a 600 line script and it took a full 2 minutes 30 seconds to finish tokenising alone
i have read and asked friends, one of which has wrote their own custom earley parser and they said there's no way it should be this slow
i have reluctantly included the entire file i'm working with, it basically tokenises the whole file, it converts the resulting tree to json, then it parses the json, the idea being the tokenisation and execution can be seperated, and the resulting json tree can be parsed by anything that supports json assuming the code itself is working within the limitations of the hardware which is pretty cool
it takes this tree and sends it into a recursive function within a class, it first iterates the whole tree finding all the function defintions and setting those up, then starts again from the top and recurse the tree, throwing the respective key to a func named that key, and doing work where required, dereferencing things and unpacking expressions as it goes by simply throwing relevant parts of the tree back into the executer and returning the result back once it hits a TERMINAL, this much is plenty fast, the slow part is absolutely the tokenisation process, it can iterate the whole tree and execute many things with many if statements for debug prints along the way far faster than the tokenisation process which is killing my project
however already with the limited syntax i'm finding the tokenisation very slow i think the easiest increase would be to move to LALR but i like how verstile earley is, and i'm unsure if it's even possible to create AHK-LIKE syntax in LALR due to its very unconventional nature
maybe i could do something funky like have one thread tokenising and as each "statement" of the tree in "statements" is complete it pushes these to the main thread queue that way as line 1 is working, lines 2 and 3 are complete, then it execs line 2, and 5-6-7 so on are working but i think it will always be "waiting" for the tokenisation as running the json is stupid fast
i've heard tokenisation can be parallelised, i do have 32 cores to my disposal and this language is intended for my personal use only, however i'm unsure how to achieve this
through some testing, commenting out all but one line it still takes a while so it seems the "volume" of the text given is part of the problem maybe the lookahead includes the entire rest of the file so perhaps if it were to tokenise it line-by-line or only relevant blocks of code it would do it fast
13 sec with all but one line commented 20+sec with all lines valid (way longer, about 2 mins) 1 sec with only one line
i thought perhaps if it was executing tokens as it went it would be faster, this would make outputting and ingesting json trees a problem, and i can't see how it would be any faster than the total 2 mins 30 sec to tokenise a large script, it would just be slowly executing things as it went
it also seems earley might take longer if the grammar is more ambiguous, but i'm unsure how to make my language more specific, i think it is quite specific
TLDR
so my options are:
at this point, i have no idea what to do, or how to implement any of these changes, i considered the project dead, which is a huge shame, but i figured i'd post here and maybe get some insight/motivation to continue
per request below i have removed ASTexecutor and VARIABLE classes so the code will only report the tree, you will have to take my word that iterating the tree takes a fraction of a second