osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Allow initializing lexers with a character iterator #41

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

Currently in main branch and the versions released on crates.io, generated lexers need to be initialized with a &str argument for the input.

It's also useful to initialize the lexers with a Iterator<Item = char> + Clone argument, as I do in from_iter branch (used in my unannounced text editor project). The problem is without &'input str we can't return a &'input str for the current match.

Instead what we can return in match_ is the byte and character bounds of the current match. The user can then extract a string from the input if they have it as a &str.

Another alternative could be to maintain a String buffer in generated lexers, and return a slice to it (possibly with character and byte bounds of the match). The function will look like fn match_(&self) -> (&str, ... char and byte bounds ...). Users can then clone the string if they need to store it, or ignore it.

I think the first alternative (don't return a slice or string, just byte and char bounds) may be good enough. The the user has the input as a string, they can use the byte indices returned by match_ to get the slice of the match. So it's possible to get the current implementation using a lexer that takes Iterator<Item = char> + Clone as the input stream.

osa1 commented 2 years ago

I just realized that we already have the method match_loc for getting the start and end locations (byte and char indices) of a match.

I think we could avoid breaking backwards compatibility by adding a codegen option to the macro for optionally (opt-in) generating the lexer with a Iterator<Item = char> + Clone argument. Changes needed: