Bugz000 commented 3 months ago

TLDR below

this is my first time making a language, i have no formal education on the matter so beware i use leyman terms for all of this but i've been toying with this idea for around 20 years at this point

i am making an earley parser for "AHK-LIKE" syntax, it is not intended to be fully ahk compatible however, but this will establish a lot of context at least

it works great but i found my tests were getting slower and slower, only taking about half a second to complete but noticable for the small script i was running so i decided to run a 600 line script and it took a full 2 minutes 30 seconds to finish tokenising alone

i have read and asked friends, one of which has wrote their own custom earley parser and they said there's no way it should be this slow

i have reluctantly included the entire file i'm working with, it basically tokenises the whole file, it converts the resulting tree to json, then it parses the json, the idea being the tokenisation and execution can be seperated, and the resulting json tree can be parsed by anything that supports json assuming the code itself is working within the limitations of the hardware which is pretty cool

it takes this tree and sends it into a recursive function within a class, it first iterates the whole tree finding all the function defintions and setting those up, then starts again from the top and recurse the tree, throwing the respective key to a func named that key, and doing work where required, dereferencing things and unpacking expressions as it goes by simply throwing relevant parts of the tree back into the executer and returning the result back once it hits a TERMINAL, this much is plenty fast, the slow part is absolutely the tokenisation process, it can iterate the whole tree and execute many things with many if statements for debug prints along the way far faster than the tokenisation process which is killing my project

however already with the limited syntax i'm finding the tokenisation very slow i think the easiest increase would be to move to LALR but i like how verstile earley is, and i'm unsure if it's even possible to create AHK-LIKE syntax in LALR due to its very unconventional nature

maybe i could do something funky like have one thread tokenising and as each "statement" of the tree in "statements" is complete it pushes these to the main thread queue that way as line 1 is working, lines 2 and 3 are complete, then it execs line 2, and 5-6-7 so on are working but i think it will always be "waiting" for the tokenisation as running the json is stupid fast

i've heard tokenisation can be parallelised, i do have 32 cores to my disposal and this language is intended for my personal use only, however i'm unsure how to achieve this

through some testing, commenting out all but one line it still takes a while so it seems the "volume" of the text given is part of the problem maybe the lookahead includes the entire rest of the file so perhaps if it were to tokenise it line-by-line or only relevant blocks of code it would do it fast

13 sec with all but one line commented 20+sec with all lines valid (way longer, about 2 mins) 1 sec with only one line

i thought perhaps if it was executing tokens as it went it would be faster, this would make outputting and ingesting json trees a problem, and i can't see how it would be any faster than the total 2 mins 30 sec to tokenise a large script, it would just be slowly executing things as it went

it also seems earley might take longer if the grammar is more ambiguous, but i'm unsure how to make my language more specific, i think it is quite specific

TLDR

so my options are:

edit my grammar so it runs more efficiently (how)
parallelise the tokenisation process (how)
switch to LALR (how)
execute the tokens as they come in (would it even be faster? tarnishes my two-section ideal, breaks the function pass)
something else i'm unaware of which is causing slowdown (i don't know what)

at this point, i have no idea what to do, or how to implement any of these changes, i considered the project dead, which is a huge shame, but i figured i'd post here and maybe get some insight/motivation to continue

per request below i have removed ASTexecutor and VARIABLE classes so the code will only report the tree, you will have to take my word that iterating the tree takes a fraction of a second


from lark import *
import json
import textwrap
import ast
import math
import random
import time
import re
import pyperclip

#todo
# move var++ statement_inc_dec to expression
# var += 1 should be statement inc

clipboard =  ""
def bugzprint(text):
    global clipboard
    print(text)
    clipboard = clipboard + "\n" + str(text)

code = """
print(round(strlen("reee") + 500+ (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)))  " is indeed a number!")
; copy the line above multiple times to make it chug, it will slow down after only 6-7 repeats
"""

coyote_grammar = r"""
start: statements

statements: (statement _NL)* statement
statement: _WS* _statement? _WS* LINE_COMMENT?
_statement: MULTI_LINE_COMMENT
          | COMMAND
          | statement_loop
          | statement_if
          | statement_function_definition
          | statement_return
          | statement_assignment
          | expression
          | statement_exp_inc_dec
          | statement_inc_dec

statement_assignment: (variable|expression_array) _WS* OPERATOR_ASSIGN _WS* expression
statement_exp_inc_dec: (variable|expression_array) _WS* (OPERATOR_EXP_INC | OPERATOR_EXP_DEC) _WS* expression
statement_inc_dec: variable (OPERATOR_INCREMENT|OPERATOR_DECREMENT)
statement_loop.2: _WS? LOOP _WS* expression _WS* LINE_COMMENT?  (_NL? block | _NL statement _NL)
statement_if: IF _WS* expression _WS* LINE_COMMENT? _NL? (block | statement) (ELSE (block | statement))?
statement_function_definition: CNAME "(" parameter_names ")" _WS* LINE_COMMENT? _NL? block
statement_return.2: RETURN expression?

// Doesn't handle default values
parameter_names: (parameter_name COMMA)* parameter_name | _WS*
parameter_name: _WS* CNAME _WS*

block: _WS* "{" statements "}" _WS*

// Doesn't handle inline assignments or comma separated expressions
expression: _WS* expression_ternary _WS*

?expression_ternary: expression OPERATOR_TERNARY_IF expression OPERATOR_TERNARY_ELSE expression
                   | expression_or

?expression_or: expression_or OPERATOR_OR expression_and
              | expression_and

?expression_and: expression_and OPERATOR_AND expression_equality
               | expression_equality

?expression_equality: expression_equality (OPERATOR_EQUAL | OPERATOR_NOT_EQUAL) expression_inequality
                    | expression_inequality

?expression_inequality: expression_inequality (OPERATOR_LESS | OPERATOR_LESS_EQUAL | OPERATOR_GREATER | OPERATOR_GREATER_EQUAL) expression_concat
                      | expression_concat

?expression_concat: expression_concat OPERATOR_CONCAT expression_bitwise
                  | expression_bitwise

?expression_bitwise: expression_bitwise (OPERATOR_BITWISE_AND | OPERATOR_BITWISE_OR | OPERATOR_BITWISE_XOR) expression_bit_shift
                   | expression_bit_shift

?expression_bit_shift: expression_bit_shift (OPERATOR_BIT_SHIFT_LEFT | OPERATOR_BIT_SHIFT_RIGHT) expression_add_sub
                     | expression_add_sub

?expression_add_sub: expression_add_sub (OPERATOR_ADD | OPERATOR_SUB) expression_mul_div
                   | expression_mul_div

?expression_mul_div: expression_mul_div (OPERATOR_MUL | OPERATOR_DIV) expression_base
                    |expression_floor_pow

?expression_floor_pow: expression_mul_div (OPERATOR_FLOORDIV | OPERATOR_POWER) expression_base
                   | expression_base

?expression_base: _WS* _expression_base _WS*
_expression_base:"(" expression ")"
                | expression_func_call
                | expression_array
                | variable
                | NUMBER
                | STRING
                | BOOLEAN
                | array_args
                | object_definition
                | expression_method

array_args: "[" array_list* "]"
object_definition: "{" object_list* "}"
expression_func_call: CNAME "(" expression_list ")"
expression_method: (variable|expression_method|expression_array) "." method "(" expression_list ")"
object_directive: (variable|expression_method) "." method ("(" expression_list ")")*
expression_array: CNAME "[" array_list "]"

array_list: (expression COMMA)* expression | _WS*
object_list: (expression COLON expression COMMA)* expression COLON expression | _WS*
expression_list: (expression COMMA)* expression | _WS*
variable: CNAME
method:CNAME

LOOP: "loop"
IF: "if"
BOOLEAN: "true" | "false"
COMMAND: "break" | "continue" | "exitapp"
RETURN: "return"
ELSE: "else"

OPERATOR_TERNARY_IF: "?"
OPERATOR_TERNARY_ELSE: ":"
OPERATOR_OR: "||"
OPERATOR_AND: "&&"
OPERATOR_EQUAL: "=" | "=="
OPERATOR_NOT_EQUAL: "!=" | "!==" | "<>"
OPERATOR_LESS: "<"
OPERATOR_LESS_EQUAL: "<="
OPERATOR_GREATER: ">"
OPERATOR_GREATER_EQUAL: ">="
OPERATOR_CONCAT: _WS "." _WS | " "
OPERATOR_BITWISE_AND: "&"
OPERATOR_BITWISE_OR: "|"
OPERATOR_BITWISE_XOR: "^"
OPERATOR_BIT_SHIFT_RIGHT: ">>"
OPERATOR_BIT_SHIFT_LEFT: "<<"
OPERATOR_ADD: "+"
OPERATOR_SUB: "-"
OPERATOR_MUL: "*"
OPERATOR_DIV: "/"
OPERATOR_POWER: "**"
OPERATOR_FLOORDIV: "//"
OPERATOR_ASSIGN: ":="
OPERATOR_EXP_INC: "+="
OPERATOR_EXP_DEC: "-="
OPERATOR_INCREMENT: "++"
OPERATOR_DECREMENT: "--"

LINE_COMMENT: /;[^\r\n]*/
MULTI_LINE_COMMENT: /\/\*[\s\S]*?\*\//
WS: /[ \t]/
COMMA: ","
COLON: ":"
_WS: /[ \t]/
_NL: /\r?\n/
_WS_OR_NL: /[ \t\r\n]/

%import common.CNAME
// Doesn't hex numbers
%import common.NUMBER
// Doesn't handle AHK escapes or single quoted
%import common.ESCAPED_STRING  -> STRING
"""

def convert_to_json(data):
    if isinstance(data, tuple):
        if len(data) == 2:
            return {data[0]: convert_to_json(data[1])}
        else:
            return {data[0]: [convert_to_json(item) for item in data[1:]]}
    elif isinstance(data, str):
        return data
    elif isinstance(data, int):
        return data
    elif isinstance(data, list):
        if len(data) == 1 and isinstance(data[0], (str, int)):
            return data[0]
        else:
            return [convert_to_json(item) for item in data]
    elif isinstance(data, Tree):
        return convert_to_json(data.children)
    else:
        return None

def indent(text, amount, ch=' '):
    return textwrap.indent(text, amount * ch)

def parse_code(code):
    try:
        ast = parser.parse(code)
        return ast
    except Exception as e:
        if isinstance(e, UnexpectedToken):
            token = e.token
            line = token.line
            column = token.column
            expected_tokens = e.expected
            token_history = e.token_history
            context = indent(str(e.__context__), 4)
            docs = str(e.__doc__)
            terminals = str(e._terminals_by_name)
            lines = code.split('\n')
            error_line = lines[line - 1]
            caret = " " * (column - 1) + "^"
            last_five_lines = "\n".join(lines[max(0, line - 5):line])
            error_message = f"{docs}\n\n{context}\nError: {token} at line {line}, column {column}.\n\n{last_five_lines}\n{caret}\n\n\n####################################\n{context}\n####################################\n\n\nExpected one of: {expected_tokens}.\nPrevious tokens: {token_history}\n"
            bugzprint(error_message)
            quit()
        else:
            bugzprint(f"An unexpected error occurred: {str(e)}")

def convert_tree_to_json(tree):
    if isinstance(tree, Token):
        return {tree.type: tree.value}
    elif isinstance(tree, Tree):
        json_tree = {tree.data: [convert_tree_to_json(child) for child in tree.children]}
        return json_tree
    else:
        return tree

def print_hierarchy(node, indent='', is_last=True, prefix='', is_child=False):
    global clipboard
    if isinstance(node, list):
        for idx, item in enumerate(node):
            new_indent = indent
            print_hierarchy(item, new_indent, idx == len(node) - 1, prefix, True)
    elif isinstance(node, dict):
        for idx, (key, value) in enumerate(node.items()):
            new_prefix = '╚' if is_last and idx == len(node) - 1 else '╠'
            has_children = isinstance(value, (list, dict)) and len(value) > 0
            if is_child and has_children:
                prefix_char = '╦═'
                connector = '>'
            else:
                prefix_char = '═'
                connector = '>'
            print(f"{indent}{prefix}{new_prefix}{prefix_char}{connector}{key}:", end=' ')
            clipboard = clipboard + (f"{indent}{prefix}{new_prefix}{prefix_char}{connector}{key}:")
            if not has_children:
                print(value)
                clipboard = clipboard + str(value) + "\n"
            else:
                print()
                clipboard = clipboard + "\n"
            new_is_last = is_last and idx == len(node) - 1
            new_prefix += ' ' if not is_child else ''
            print_hierarchy(value, indent + (' ' if new_is_last else '║'), new_is_last, prefix, has_children)

def print_tree(tree_str, other=0):
    if other:
        return tree_str
    else:
        if is_valid_json(str(tree_str).replace("'","\"").replace("\"\"","\"")):
            bugzprint("\nroot:")
            print_hierarchy(json.loads(str(tree_str).replace("'","\"").replace("\"\"","\"")))

def is_valid_json(my_str):
    try:
        json.loads(my_str)
        return True
    except ValueError:
        return False

print("assigning grammar")
parser = Lark(coyote_grammar, parser='earley', lexer="dynamic_complete")
print("tokenizing")
ast = parse_code(code)
print("converting to json obj")
json_tree = convert_tree_to_json(ast)
print("converting to json string")
json_str = json.dumps(json_tree)
#bugzprint(json_tree)
#print_tree(str(json_tree).replace("'","\""))

MegaIng commented 3 months ago

Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.

Bugz000 commented 3 months ago

Please create a smaller example script that showcases the actual thing you think could be faster, and don't include it as pasted text (even then, please make sure to correctly format it in the future), but create a gist, that is easier to work with.

you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol

i cannot offer a smaller example as i think the problem resides in the size of the grammar paired with the size of the input text, and due to the nature of my recursive execution, all grammar must have their respective functions, there is only exactly what is absolutely necessary to demonstrate the issue provided

i suppose you could remove the whole execution section but i also leave this to demonstrate it is infact the tokenisation process that is the problem, and the parsing of the resulting json tree is indeed very fast, so there's little reason for it to be SO slow...

however feel free to delete the ASTexecutor and VARIABLE classes and the respective call at the bottom if you don't want to see that then you will truly have the bare minimum

Bugz000 commented 3 months ago

i have edited the post and added more context

Bugz000 commented 3 months ago

i have removed astexecutor and variable classes, now it is minimum though it cannot be verified anymore

MegaIng commented 3 months ago

you got to it before i could even edit the thing, i don't use github unless necessary so i make mistakes lol

Tip for the future: You can preview your messages before posting. Editing messages afterwards is a bit less optimal since I at least always receive an email first. Having lots of poorly formatted code in the email, hiding also the button to go the github message behind a wall of text is annoying for everyone else.

You are using dynamic_complete. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.

And yes, earley has a time complexity of O(n^3) for highly ambiguous grammars. But unless you are trying to parse a natural language, you don't need that. Try using parser="lalr" and strict=True and fix the problems that come up one by one.

Bugz000 commented 3 months ago

You are using dynamic_complete. That is the by far slowest possible option you can use. Why are you using it, i.e. what concrete problems did you run into that made you use it? You should almost never need to use it for a language you want to design. It's best use is for when you have messy input or something like that.

running the same thing in basic creates error:

Error: + at line 2, column 28.

print(round(strlen("reee") + 500+ (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + (5*(1/((1/((5*(1/((5*(1/((5*(1/((1/((5*(1/((5*2)+6)-1))+6)-1) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)) + ((1/((5*(1/((5*2)+6)-1))+6)-1)*2)+6)-1))+6)-1))+6)-1) + ((1/((5*(1/((5*(1/((5*2)+6)-1))+6)-1))+6)-1)*2)+6)-1))+6)-1) + (500 * 500)))  " is indeed a number!")
                           ^

####################################
    None
####################################

Expected one of: {'BOOLEAN', 'STRING', 'LSQB', 'NUMBER', '_WS', 'CNAME', 'LBRACE', 'LPAR'}.
Previous tokens: None

and editing the line to something simple

Error: := at line 2, column 5.

var := 5+5+5
    ^

####################################
    None
####################################

Expected one of: {'LSQB', 'LBRACE', '_WS', 'NUMBER', 'STRING', 'CNAME', 'BOOLEAN', 'LPAR'}.
Previous tokens: None

relevant lines in grammar

start: statements

statements: (statement _NL)* statement
statement: _WS* _statement? _WS* LINE_COMMENT?
_statement: MULTI_LINE_COMMENT
          | COMMAND
          | statement_loop
          | statement_if
          | statement_function_definition
          | statement_return
          | statement_assignment
          | expression
          | statement_exp_inc_dec
          | statement_inc_dec
statement_assignment: (variable|expression_array) _WS* OPERATOR_ASSIGN _WS* expression

variable: CNAME
expression_array: CNAME "[" array_list "]"
OPERATOR_ASSIGN: ":="
_WS: /[ \t]/

expression: _WS* expression_ternary _WS*
 // continue to the expression tree

MegaIng commented 3 months ago

The default is dynamic, not basic. It seems like the basic lexer has some other bug that I am currently trying to figure out.

Bugz000 commented 3 months ago

trying dynamic it completes testing dynamic_complete on around 700 lines of var := 5+5+5

assigning grammar
tokenizing
--- 31.171510457992554 seconds ---

testing dynamic on around 700 lines of var := 5+5+5

assigning grammar
tokenizing
--- 31.176007747650146 seconds ---

zero change

this didn't seem right so sanity check i ran the tests again and ensured i am indeed editing and running the correct file etc dynamic_complete

tokenizing
--- 32.824604749679565 seconds ---

dynamic test 1

tokenizing
--- 25.929264545440674 seconds ---

dynamic test 2

tokenizing
--- 28.417376279830933 seconds ---

dynamic test 3

tokenizing
--- 27.651228427886963 seconds ---

so about 5% increase in speed (sometimes)

MegaIng commented 3 months ago

Aha, OPERATOR_CONCAT is the problem for lexer='basic'. Having space be a valid operator is always going to cause problems for fast tokenization scheme.

Bugz000 commented 3 months ago

Aha, OPERATOR_CONCAT is the problem for lexer='basic'. Having space be a valid operator is always going to cause problems for fast tokenization scheme.

that sounds about inline to what i had to deal with, one of the best features of AHK (and it's quite unconventional) is it's concat with space

print(var " is stored inside the variable") is valid, no . or + concat, and everything is string, float, string int all in strings, so it's very convenient to use, just throw the data it will handle it

however space concat is by far the hardest thing to get working which is ultimately what made me use earley over LALR and (probably) dynamic_complete over basic (though dynamic seems fine too)

i'm unsure if LALR is even possible for this syntax

to be clear nobody has tokenised ahk like this before, all the ahk source codes, _B, _L, v2, _H and ironahk etc all operate the syntax by hammering away at it with raw string funcs, (frankly this is insanity to me) so i wish to tokenise it, "as closely to original syntax" but i'm willing to make concessions for some form of standardised i/o...

i hope you can see what limitations this brings

erezsh commented 3 months ago

Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.

However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/, which would be much much better as _WS: /[ \t]/+

Bugz000 commented 3 months ago

Only a drive-by comment, but why do you need the space to be explicit? Why not just ignore whitespace, and use expr expr as whitespace concat? I might still cause a bit of ambiguity, but I believe it would be much easier on Earley.

i think i tried this but i ran into problems where

print( var (5+5)) ;  would resolve to
print(  var(10)  )

"function var not found" (or vice versa based on python hash seed or something) the vice versa being

print(var (10)) ; correctly

never been gaslit by a terminal before but lark pulled that off for a full 2 days somehow

However, I would guess most of the slowdown is actually coming from _WS: /[ \t]/, which would be much much better as _WS: /[ \t]/+

test with that whitespace, and dynamic:

tokenizing
--- 26.913031339645386 seconds ---

comparable

erezsh commented 3 months ago

You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.

test with that whitespace

What did you test?

Did you change all the WS* into WS? I think that is causing the slowdown.

Bugz000 commented 3 months ago

You can use regexps with lookaheads (and I think also lookbehinds) to control for whitespace. It would be much more efficient.

i'm unsure how to do this but i will read on it

test with that whitespace

What did you test? I'd appreciate it if you used full sentences.

the whitespace you provided. i didn't think it needed more clarification given there was two options, one of which was current, but there you go

_WS: /[ \t]/+

i tested with this whitespace change that you suggested

Did you change all the WS* into WS? I think that is causing the slowdown.

An unexpected error occurred: No terminal matches '
' in the current parser context, at line 1 col 1

^
Expected one of:
        * _WS

now we run into new line issues which is frustrating, i've had this newline problem before aswell

 An unexpected error occurred: No terminal matches 'v' in the current parser context, at line 1 col 1

var := 5+5+5
^
Expected one of:
        * _WS

removing the first blank newline of the code it shows more inherent problems with changing _WS* to _WS

erezsh commented 3 months ago

That's because it should be sometimes _WS and sometimes _WS?. You'll have to do it yourself based on the context. But that should help the parser.

Bugz000 commented 3 months ago

so we establish streamlining the grammar helps, and it does, marginally, i estimate we could see a 10-20% drop in speed with full optimisation on that front

a further 10x faster speed switching to LALR but i'm not that clever

but remember my options are:

edit my grammar so it runs more efficiently (how)
parallelise the tokenisation process (how)
switch to LALR (how)
execute the tokens as they come in (would it even be faster? tarnishes my two-section ideal, breaks the function pass)
something else i'm unaware of which is causing slowdown (i don't know what)

i have 32 cores available, meaning we could very easilly gain a 3200% increase in speed with parallelisation, this has not been discussed once yet

remember, we're not looking for elegance here, this is nothing fancy it just needs to be somewhat functional, but 2 minutes to tokenise 700 lines is frankly ridiculous

but as you know there's not a problem in the world that can't be solved by throwing more compute at it

how would this be achieved?

the code can be as slow as it wants to be but it's using only 5% of my cpu, with parallelisation, any improvements would also be multiplied by core count, i think this should be the priority currently

i read that it is "trivial" to parallelise earley parsing but i'm unsure how, short of using an earley parser to split the code and launching the split code snippets as their own comprehensive parser or something, there is little to no documentation of paralellising lark

erezsh commented 3 months ago

The easiest option to get it to run faster is to use PyPy instead of CPython.

I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?

MegaIng commented 3 months ago

First of all: tokenization is only a small part of parsing and since switching from dynamic_complete to dynamic changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.

Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.

Bugz000 commented 3 months ago

The easiest option to get it to run faster is to use PyPy instead of CPython.

i tried this, it won't use lark for some reason, just keeps saying "module not found"

I have no idea where you read it's "trivial" to parallelize parsing. How many "parallel" parsers do you know?

i read it here https://news.ycombinator.com/item?id=28259458

LL or LR is almost certainly faster then Earley, since in general more restricted = faster. Earley's algorithm should be able to be trivially parallelized

First of all: tokenization is only a small part of parsing and since switching from dynamic_complete to dynamic changed barely nothing, that isn't a relevant factor. AFAIK, parsing is way harder to parallelize.

i was hoping there was some magic way to get it working but it appears it is simply just as hard as it seems, so i guess parallelisation is out the window besides tokenising seperate files at the same time

however this person was doing something with multi threading https://github.com/lark-parser/lark/issues/493

Second, we are talking about python. Python is really hard to make functional in parallel, especially for such a linear problem. Using more cores is not an option unless you want to create quite a large chunk of code that currently does not exists in lark, and I am not even sure if that is possible to a useful degree.

so how would you propose further improving this, can you see a way this can work in LALR? as this can gain an instant 10x speed increase (judging by online benches)

erezsh commented 3 months ago

@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.

Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.

Bugz000 commented 3 months ago

@Bugz000 Do you realize every commit to Lark is also tested with PyPy? But no, you tried it, and it "doesn't work". Well, you lost my interest.

there's no need to get defensive, you also said earlier that i should "use full sentences" but edited it out... if you are getting upset because i simply stated the facts of the matter then that is not my problem but i thankyou for your input regardless... however anemic said input was

Maybe it's possible to write it with LALR. Depends on the grammar. You seem to know all these parsing experts, I'm sure one of them can help you.

again with the sarcasm, there is no need to get upset about this... i'm not continuing this reply it's clear you have run out of constructive things to add and are now turning me into some strawman to attack rather than accept you don't know the solution.

has anyone else some constructive things to add, without a generous dollop of attitude ontop? i feel this guy has derailed the thread a little

MegaIng commented 3 months ago

if you are getting upset because i simply stated the facts

You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong. But apparently you are unwilling to get help with that so ...

i feel this guy has derailed the thread a little

"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes. The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.

Bugz000 commented 3 months ago

You were concisely stated "it doesn't work", which is almost surely caused by you doing something wrong.

i pass the python file to the exe as an argument... what else is there to do. it was my understanding it is "plug n play"

But apparently you are unwilling to get help with that so ...

where did i say this

"that guy" is the primary dev of this project. We have a pattern of people not understanding enough about parsing and being confused why earley is so slow and more or less demanding magical fixes.

Dev or no dev, an attitude problem is an attitude problem, there's a nice way to say everything than treat everyone as if they're stupid? i have been nothing but cordial and polite. any animosity is their own, and only their own. i have taken no part in what whatsoever. in short; their problem they are upset.

i do not demand magical fixes, i was simply asking suggestions and places to learn, i got a couple, while being treated like i'm stupid, then this final "i've lost interest" if i may give some advice in return; frankly i think you lost interest in computers a long time ago.

The answer is "use PyPy" and "use lalr instead". If your friend you claim to know has a suggestion for how to massively improve the speed of earley, they are open to make those, preferably in the form of a PR.

got it, neither of you know what's wrong, closing thread.

erezsh commented 3 months ago

got it, neither of you know what's wrong

I know exactly what is wrong. You make mistake after mistake, but you're too strong-headed to admit you might be wrong.

I lost my patience because you kept blaming the tool for your own mistakes and shortcomings, you made crazy claims like "it is "trivial" to parallelise earley parsing" and then ignored me when I challenged you.

Sorry, but you're not in a position to give advice.

Best of luck with your project.

lark-parser / lark

earley very, very slow #1402

TLDR