maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.85k stars 115 forks source link

Fix `Lexer::clone` leak and UB + tests #390

Closed Jakobeha closed 5 months ago

Jakobeha commented 5 months ago

Lexer::clone shouldn't clone the inner ManuallyDrop, because doing so clones the inner value, which is moved out in Lexer::next.

This causes use-after-free if the lexer is cloned after the last-returned token is dropped, especially if the token contains an overridden implementation of Clone (such as Rc) that tries to read the dropped data.

It causes a memory leak if the token contains a heap-allocated value, because cloning makes a new allocation. This allocation is in the ManuallyDrop and it's guaranteed to be overridden before the call to ManuallyDrop::take, so it's never freed.

Another thing: https://github.com/maciejhirsz/logos/issues/263 (make Lexer implement Copy) probably should be added (referencing here because it looks like the issue has been forgotten).

jeertmans commented 5 months ago

Hello @Jakobeha, thank you for this very comprehensive analysis!

But, if I understand well, then cloning Lexer becomes virtually useless?

Jakobeha commented 5 months ago

Cloning can be used as a kind of lookahead: the cloned lexer will return the exact same tokens that would be returned by the original lexer without advancing the original (provided the lexing functions are pure). One can also clone_from to cloned lexer on the original to "commit" the peeked tokens.

I'm not sure if it's the most efficient way to implement a lookahead (I don't know the performance vs storing an array of spans and tokens and then making next into a conditional), and source being cloned in this method is redundant, but I'm sure in most cases it's fast enough to not matter.

The same thing (creating a "lookahead" lexer) can also be done using Lexer::with_extras(original.remainder(), original.extras), or to keep the same span for lexed tokens,

let mut clone = Lexer::with_extras(original.source(), original.extras);
clone.bump(original.span().end);

Though it's less convenient, especially if the lexer is wrapped in another data structure, because then Clone on the outer structure has to be manually implemented.


Related, I also thought about Copy on Lexer, and personally I don't think it's a good idea, because it makes it easy to unintentionally copy the lexer instead of mutably borrowing it. Clone doesn't have this issue because it's explicit.

maciejhirsz commented 5 months ago

This is a pretty good catch!

FWIW whenever the mythical rewrite happens that turns all the internal gotos from functions to match branches over an enum in a loop, the token field can be replaced by just a variable on stack inside next.