Efficient parsing of UTF-8

vrtbl / passerine

A small extensible programming language designed for concise expression with little code.

https://passerine.io

MIT License

1.05k stars 38 forks source link

Efficient parsing of UTF-8 #41

Open Plecra opened 3 years ago

Plecra commented 3 years ago

https://github.com/vrtbl/passerine/blob/3796465ae108bba8b3a66f5c3ab71709a76b3fc8/src/compiler/lex.rs#L152-L158

UTF-8's ability to represent ASCII in-place allows lexers for ASCII characters to be implemented extra-efficiently by skipping decoding of utf-8 bytes. It's fine to simply scan directly through the bytes (and a crate like https://docs.rs/memchr can vectorize this) for characters like ' ' and '\t'

slightknack commented 3 years ago

Right now, the lexer is very inefficient, as it doesn't use a state machine. I think we should refactor it; this should be a viable optimization.

As for pulling a crate like https://docs.rs/memchr, Passerine's core depends only on Rust's standard library and has no external dependencies. For this reason, if we were to do vectorization, I suggest we extract the core concept of the library to vectorize as needed.

Thanks for pointing this out! much appreciated :D

slightknack commented 2 years ago

@Plecra I recently completely redid the lexer in big-refactor. Would you mind taking a look at it and letting me know what style/performance improvements can be made? I suspect there's something with the way we're using peekable iterators that could be improved:

https://github.com/vrtbl/passerine/blob/f8fabd0e364aaeaf1bbd882e2f69f81c4ad26f8e/src/compiler/lex.rs#L288-L387

Plecra commented 2 years ago

It's awfully late, but I just saw this in my email, and I like it 😆. The token representation and parsing logic all flows very nicely. If parsing ever becomes a bottleneck, passerine could return to the once/chain/peekable pattern, since rustc will chug on it a bit.