Closed rben01 closed 2 months ago
Really nice! As you saw in the TODO
comment, I have been meaning to do this eventually.
I think this will also enable a lot of downstream optimizations if we push this even further (with things referring to the original source instead of cloning stuff).
Also shrunk
Tokenizer
by doing all math in terms of byte offsets into its input (using existingSourceCodePosition
fields) instead of storing a separateVec<char>
with char indices
Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition
. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Span
s much smaller (currently 32 byte). And Span
s are everywhere.
So, the input is now provided as a shared ref argument to all the methods that used to refer to
&self.input
(now either a&str
or a&[Token]
)
Yeah, it's not great. It's maybe not too bad either, so I will just merge your PR as is. Thank you very much for this valuable contribution!
Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.
Right, sounds like you could store just the byte offset in a Span, and separately store the cumsum of line lengths. Then to find where a byte offset goes, binary search to find the largest line length cumsum less than the byte offset — that's your line number. Position in the line is byte offset minus that cumsum.
This doesn't sound that bad, just annoying. Worth me giving it a shot? I suppose the questions is where to store the cumsum of line lengths. Probably in the same place the lines themselves are stored? (I don't actually know where that is, but obviously it exists because it's used to print error messages.)
Hey, it looks like only the removal of ctx.dimension_registry().clone()
was merged, not the whole PR. Was this intentional? (Or am I misunderstanding GitHub’s UI? GitHub says my branch is 12 commits ahead of master.)
Your branch is properly integrated (check out the commit history of this repo). I rebased your branch on top of master instead of creating a merge commit. The rebase creates new commits which are not identical (and have a different hash) compared to your local commits. This is probably why you see "N commits ahead of..".
Unfortunately due to borrow checker limitations, this required moving
input
fields out of bothParser
andTokenizer
, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input. The underlying issue is that returning aToken<'_>
that borrows from&self
really trips up the borrow checker in a way that a non-borrowingToken
doesn't. So, the input is now provided as a shared ref argument to all the methods that used to refer to&self.input
(now either a&str
or a&[Token]
) And there were a lot... a-lot-a-lot... But nowToken
doesn't carry an ownedString
Also shrunkTokenizer
by doing all math in terms of byte offsets into its input (using existingSourceCodePosition
fields) instead of storing a separateVec<char>
with char indicesAlso made
ForeignFunction.name
a&'static str
instead of aString
.No new tests, but all existing tests pass