Replaced the `String` in `Token` with `&'a str`

sharkdp / numbat

A statically typed programming language for scientific computations with first class support for physical dimensions and units

https://numbat.dev

Apache License 2.0

1.26k stars 53 forks source link

Replaced the `String` in `Token` with `&'a str` #559

Closed rben01 closed 2 months ago

rben01 commented 2 months ago

Unfortunately due to borrow checker limitations, this required moving input fields out of both Parser and Tokenizer, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input. The underlying issue is that returning a Token<'_> that borrows from &self really trips up the borrow checker in a way that a non-borrowing Token doesn't. So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token]) And there were a lot... a-lot-a-lot... But now Token doesn't carry an owned String Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Also made ForeignFunction.name a &'static str instead of a String.

No new tests, but all existing tests pass

sharkdp commented 2 months ago

Really nice! As you saw in the TODO comment, I have been meaning to do this eventually.

I think this will also enable a lot of downstream optimizations if we push this even further (with things referring to the original source instead of cloning stuff).

Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token])

Yeah, it's not great. It's maybe not too bad either, so I will just merge your PR as is. Thank you very much for this valuable contribution!

rben01 commented 1 month ago

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

Right, sounds like you could store just the byte offset in a Span, and separately store the cumsum of line lengths. Then to find where a byte offset goes, binary search to find the largest line length cumsum less than the byte offset — that's your line number. Position in the line is byte offset minus that cumsum.

This doesn't sound that bad, just annoying. Worth me giving it a shot? I suppose the questions is where to store the cumsum of line lengths. Probably in the same place the lines themselves are stored? (I don't actually know where that is, but obviously it exists because it's used to print error messages.)

rben01 commented 1 month ago

Hey, it looks like only the removal of ctx.dimension_registry().clone() was merged, not the whole PR. Was this intentional? (Or am I misunderstanding GitHub’s UI? GitHub says my branch is 12 commits ahead of master.)

sharkdp commented 1 month ago

Your branch is properly integrated (check out the commit history of this repo). I rebased your branch on top of master instead of creating a merge commit. The rebase creates new commits which are not identical (and have a different hash) compared to your local commits. This is probably why you see "N commits ahead of..".