Open ccleve opened 2 years ago
Could you provide an example of the sort of transformations you're looking to perform? Without further information it sounds like you need to have a pest grammar that defines what you expect to see in all runtime cases, and then when consume into an AST to apply runtime logic at this stage. Potentially this is an easier way of thinking about it than first having a lexer and secondarily a parser.
Sure. This is for a search engine.
The first step is to tokenize the text before indexing it. A simple way to do it is with a whitespace tokenizer, but there are much more complex and involved ways.
The second step is to expose the tokens to the developer at run time for arbitrary transformations. For example, you may want to strip apostrophes: "O'Brien" -> "obrien". For some terms, you don't strip: "don't". For French, you do a completely different transform: "L'something" -> "le something". We'll need the flexibility to have the developer supply an arbitrary transformation function: transform(list_of_tokens) -> list_of_tokens
.
Then, if this text is actually for a query, we build an AST. "word AND (word OR word)
" needs to be assembled into an AST for execution.
In the past I did it with JFlex for lexing and JavaCC for parsing. I later did it again using Antlr for both.
It would be very cool if Pest could accept a list of tokens for the parser stage.
So from my understanding you'd have an initial basic grammar whist is essentially just stripping whitespace as you say and perhaps you'd have a rule called "token". This essentially parses your input into a super basic AST which is just Vec<Token>
. Your developers then write a Vec<Token> -> Vec<Token>
. The naïve approach would then to be to serialise this and have a second grammar for your query building.
I think if you wanted to avoid this serialisation you would need to know about your developers manipulations at compile time, but wouldn't expect the performance cost to be noticable either way.
What might be a nicer approach could be for your developers to write a pest grammar for each thing they care about: for example, you could have a grammar for French, and another for English etc etc. Then do the same approach as above but this way your developers have a bit of an easier time manipulating the Token stream. Does that make sense?
The external developers will have the option to write transformers in other languages like Python or Javascript. They can do Rust if they need speed, other langs if they don't. So it won't be possible to require them to use Pest.
Some other parsers handle this by accepting input other than &str. Here's a chart: https://github.com/kevinmehall/rust-peg#comparison-with-similar-parser-generators
The peg crate takes anything. The nom crate takes some kind of custom thing. Lalrpop takes only &str, but it has a way to plug in an external lexer.
Yes, I could serialize the tokens to single characters with one char per token type, make a &str, parse that, and then post-process to retrieve the original token, but I'd like to avoid that.
So you have some string input, this gets pre-processed by an arbitrary language before coming to rust for your query parser. You'd then need to message this over to rust (ffi or via some io) and from that stage the input is essentially a string from rusts perspective.
I think then what you're saying (correct me if I'm wrong) is that you don't want to lose the information your developers have created as they'll have grouped together things into tokens. For example, using your previous example a developer might have taken the input l'hôpital and translated it into two tokens le and hôpital.
If you define to them that the output of their function must be tokens separated by whitespace would that work? In the example above you'd tell the developer to output "le hôpital".
If you needed more information than that about each token you could flesh this out a bit instead of it just being whitespace separated, and then provide a target spec for your developers.
The transformer does more than just process text. It attaches metadata to tokens as well. Yes, there is an ffi interface that sends tokens back and forth as structs.
For this to work, Pest would need to accept not just characters, but tokens that implement some kind of trait, just like nom, peg, lalrpop, bison, antlr and javacc. They all do it. It's a pretty common use-case.
Yeah so with the current pest setup you'd need to have your metadata in the stream, defined by your pest grammar. I agree this causes an unnecessary serialise+deserialse which shouldn't really be necessary.
I think the way to work through this would be to replace the argument of parse with an impl trait and work through the compiler errors. Would be backwards compatible too in theory because we could implement it for &str.
Might be worth benchmarking it in your use case to see how much time it actually costs you to see if it matters for your use case, but I agree it would be a good improvement.
Any thoughts on this?
Long time!
Honestly I have not really been using pest much (hence the long wait) but I thought I'd make a dumb start to generate some ideas. Please see my fork here.
I created a trait ParseInput with the idea being to use & impl ParseInput
in place of &str
.
Only spent about an hour on it and so far just churning through compiler errors - no actual decision making yet. I can keep chipping away at it when I get chance. Feel free to take a look yourself/anyone else.
However, it would be good to get the perspective of someone who actually works on pest rather than just me hitting things with a hammer...
My hunch is that separating scanner and parser would require substantial changes in the current codebase (which is scannerless with operations on &str
intermingled in the parser state etc.). Perhaps it may be easier to try that in a fresh codebase step-by-step, e.g. the pest3 overhaul that was left unfinished:
https://github.com/pest-parser/pest3/blob/master/todos
https://github.com/pest-parser/pest/issues/333
My idea was that whilst it would be a lot of changes, they would be compiler driven and hence shouldn't be too complex.
However, it is true that there is no guarantee that the implementations for the trait ParserInput for @ccleve 's token stream would be any faster than serializing to a string and passing that into Pest. I think it's good practice to accept more generic arguments just from an API design perspective because you define exactly what you need, but it wouldn't necessarily solve this problem.
If indeed it wasn't significantly faster then the design would have to be re-worked which would be significant work. I personally don't use Pest enough to justify continuing but @ccleve you may wish to as they have a use-case.
Just wanted to leave a note here - this is something I would desire too, but for a different reason. I'm making a programming language and would eventually want to add in support for compile-time macros. I feel like separating out the lexer/scanner and parser would be useful for this, as token/lexeme-based macros would be easier to write than syntax-tree-based macros. There would probably need to be a way to re-compute the syntax tree after macros have been executed though. I could solve this issue by converting the pest
Pair
s into some custom data type, but meh.
Just wanted to bump this issue up. I saw this post on Hacker News today:
https://tratt.net/laurie/blog/2023/why_split_lexing_and_parsing_into_two_separate_phases.html
See also the discussion:
In my app I need to tokenize some text, apply some transformations on it, and then generate an AST. The types of transformations are known only at run time.
In a previous iteration of this app I had a separate lexer and parser, and applied the transformations in the middle. It worked well.
It's not clear how to do this in Pest. How do I supply the output of a custom tokenizer to Pest to generate the ASTs?