Parsing - Githubissues

tmedwards commented 4 years ago

There should be exactly one core parsing engine for SugarCube—rather than umpteen ones scattered here and there—that may be reused for various parsing needs via separate grammars—i.e., core parsing, macro argument parsing, etc.

The parser must be performant. Parsing times should either stay roughly the same or noticeably improve versus the current crop of parsers from v2.

Minor, domain specific parsers may remain where sensible.

greyelf commented 4 years ago

Would you consider making the parsing engine a module so that other story format developers could use it as well.

tmedwards commented 4 years ago

It will definitely be a module. How tightly coupled it will be to other parts of SugarCube remains to be seen. I'm also unsure who'd want to use it regardless, but it will be open source, so….

I haven't even decided exactly what kind of parser it's going to be yet. I mean, preferably it would be something similar to the character-based lexer (src/markup/lexer.js) + actions style SugarCube currently uses to parse macro arguments and square bracket markup, but I'm afraid that won't be performant enough.

I know for a fact that SugarCube's macro argument parsing took a performance hit when I switched from the single regular expression based parser to the lexer based one. Whether that's because the latter is just slower or because the former wasn't doing a proper job, thus they're not really comparable, I can't say.

This is going to require some testing.

ChapelR commented 4 years ago

I think having a relatively stand-alone module would be neat and allow the easier development of certain code analysis tools. That said, it's not like someone couldn't extract it if they really wanted to, so it'd probably depend on how easy that winds up being.

Uzume commented 4 years ago

I haven't even decided exactly what kind of parser it's going to be yet. I mean, preferably it would be something similar to the character-based lexer (src/markup/lexer.js) + actions style SugarCube currently uses to parse macro arguments and square bracket markup, but I'm afraid that won't be performant enough.

I know for a fact that SugarCube's macro argument parsing took a performance hit when I switched from the single regular expression based parser to the lexer based one. Whether that's because the latter is just slower or because the former wasn't doing a proper job, thus they're not really comparable, I can't say.

Sadly, it seems like code history was lost when SugarCube moved from Bitbucket hg as I cannot find the commit where the parser changed to support from a single regular expression to the current lexer. The initial commit seems to be based on v2.30.0 from December 25, 2019 and I find no history before that. I am sure that is at least in part due to spitting the repository into separate v1 and v2 repositories.

You should definitely consider using existing developed parsing tools. SugarCube has some interesting parsing needs. Since its currently implementation seems to employ a regular expression based lexer, the grammar is likely to be (at least mostly) regular in nature, however, with SugarCube macros and templates the language definition is not static in nature. Due to the dynamic grammar requirements that limits some of the possible tools, e.g., parser generators are normally targeted towards using a static language definitions to generate static parsers. A parser combiner library seems appropriate.

I am a fan tools that support of PEG over CFG and regular grammars but that might not be much of an issue for SugarCube if a fast consolidated parsing solution can be found or developed for its needs.

There is a good if somewhat dated (2017) blog entry on JavaScript parsing tools at: https://tomassetti.me/parsing-in-javascript/

I recommend looking at:

https://github.com/cdiggins/myna-parser recursive descent parsing and there are examples for markdown and HTML which is not too far from SugarCube "wiki" markup
https://github.com/jneen/parsimmon monadic LL parsing with an API based on the renown Haskell "parsec" library

They each seem fairly performant based upon the JSON parsing benchmarks (designed to be run yourself in your browser) at: https://sap.github.io/chevrotain/performance/

LL parsing has more historic research, but recursive descent parsing has made a come back with PEG and performance innovations. For those reasons I tend to favor recursive descent parsing and given a choice of which of the above to invest time on I would choose myna but I am sure either could prove to be a good solution.

The other option would be to try tweeking your existing code, benchmarking changes to see how each affects performance. Perhaps you can find what is slowing down the current implementation vs. the the earlier one.

tmedwards / sugarcube-3

Parsing #4