Open ratmice opened 2 years ago
FWIW, It looks like this would require reimplementing LRNonStreamingLexerDef
, i'm not yet sure how much that entails exactly, the reason is primarily because we can't reimplement LRNonStreamingLexerDef::lexer
while the Rule::re
is private.
I think however, that I probably may have needed to reimplement that eventually, in order have access to spans for Rule
's name
and re_str
fields of LRNonStreamingLexerDef, pointing referencing the l_file
.
@ltratt any chance you could have a look at the branch https://github.com/ratmice/nimbleparse_lsp/tree/issue9
before I go off into the weeds and start reimplementing, LRNonStreamingLexerDef
,
because
rope_lexer
feature/functionname
and re_str
don't seem terribly controversial.Question: besides the above questions, there is a simple question about whether span_lines_str
looks ok,
I interpreted the docs to mean the span is containing line numbers rather than a byte index, but the docs and warnings about byte indices therein seem misleading.
I'm not sure I've got enough context to understand the question. It might be relevant to note that the reason for the unwieldy LRNonStreamingLexerDef
name is because one day I imagined lrpar supporting LRStreamingLexerDef
(and possibly other variants) which would be more annoying to use but more flexible (for those who need it). Is that the sort of thing you're working towards here?
Question: besides the above questions, there is a simple question about whether span_lines_str looks ok, I interpreted the docs to mean the span is containing line numbers rather than a byte index, but the docs and warnings about byte indices therein seem misleading.
Span
s are always defined in terms of bytes, not characters or lines -- there's no other way to do efficient string things with utf-8. Note that line_col
returns (line, char) indexes but its input Span
must still be in terms of bytes.
Sorry, these were mostly notes to self until I ran into the need for private API, will try to explain more clearly:
Rope is a tree-like data structure for strings, instead of storing a string in a contiguous buffer, it'll store slices of strings in a tree. This tree has in addition to the slice has some metadata for each node, like its offset/length in bytes, number of lines in the slice. This allows for basically O(log n) editing, and a very quick clone, by copying the parent node, which has just a pointer to the rest of the tree. Where mutation is done by replacing the node which is edited with a new node.
In that way regarding Streaming
/NonStreaming
it is no different than string, so it also would qualify as a NonStreamingLexer
too. I don't actually want to reimplement LRNonStreamingLexerDef
, I was just going to reuse the existing LrNonStreamingLexerDef
, and tried to build LRNonStreamingLexer
and Lexer
traits for Rope
(this is LRNonSreamingRopeLexer
, but the constructor for LRNonStreamingLexer
is LRNonStreamingLexerDef::lexer
which has a rule.re
regex field private. I can perhaps just create the re
directly from the re_str
(I need to put fresh eyes on that to see that could be done efficiently without recreating regexes multiple times).
Some background:
In the LSP we store data internally using Rope, the editor sends us 'incremental' changes to text, just the edited text and it's position. After that, we call parse_file
, which currently converts the rope using to_string
which will make a clone of the entire string.
So this was aimed at just lexing directly on the rope
, bypassing that initial clone, but still non-streaming.
Span
s are always defined in terms of bytes, not characters or lines -- there's no other way to do efficient string things with utf-8. Note thatline_col
returns (line, char) indexes but its inputSpan
must still be in terms of bytes.
Ahh, I get it now, it doesn't interpret the span as lines, but pads the beginning and end of the returned string to a line boundary. I believe that I totally missed the parenthetical in the docs which is very clear. Apologies.
Edit removed old comment it doesn't seem relevant anymore:
I've kind of given up for the time being on doing this without modifying the lrlex library, e.g. by deriving traits from lrlex from within nimbleparse_lsp. It doesn't seem possible without stablizing a bunch of things that don't want to be stablized inbetween LRNonStreamingLexerDef
and the impl of Lexer
to use a custom Lexer
impl.
I did though, produce a nice small branch of lrlex with a rope feature. https://github.com/ratmice/grmtools/tree/rope_lexer https://github.com/ratmice/nimbleparse_lsp/tree/issue9-take2
I need to do performance comparisons and sanity testing, but so far it appears to work
Edit: It is definitely not correct, sometimes (hitting unwrap panic) but its a start.
I fixed the implementation, unfortunately it is a serious performance regression,
Consider a list of chunks for the string "abcdef",
["abc", "def"]
, because of the way the regex works it ends up coping:
a..
, b..
, c..
so we actually end up doing more copying to rebuild a strings from it to pass to regex.
So at this point we are better off converting to strings and doing a single copy up front.
There are some comments to linking to some bug reports, which has some links to others work on regex over ropes. I'm not aware of anything stable though I'll have to look through it more.
So after a working but performance regressing implementation I have what I consider to be 3 options:
next().next().prev() == next()
sense. I'd need to move to basically a LendingIterator
over slices, where next()
and prev()
are bijective, perhaps exact size.A few quick thoughts -- these might be irrelevant or obvious but just in case:
So I had a thought about this one which is probably worth trying, at least if only to tell us how much improvement lexing ropes might bring without bringing in a bunch of regex changes.
If the version of rope we used ended slices on a word boundary, e.g. between a word and the whitespace after it, you should be able to lex over slices because the vast majority of regular expressions would no longer span 2 slices.
It probably entails building a rope with this property, but seems a lot simpler than building regex engine which can continue across slices :shrug: Edit2: Incremental regular expressions.
Edit: I guess the difficulty here may be finding a good boundary when performing an edit operation, It could be that you end up splitting an edit across slices to enforce the slice boundary property in a way which makes you do an extra split. Another difficulty is strings, and things such as comments, or blobs which may contain whitespace...
If the version of rope we used ended slices on a word boundary, e.g. between a word and the whitespace after it
If you want to support whitespace sensitive languages (e.g. Python or Haskell) then this might not work. Parsing is... fun ;)
Indeed, it is just an optimization in a particular configuration of the lexer regexes, in which case we'd fall back to the full string copy like we currently use. But I don't think it works with whitespace sensitive languages currently, because generally you'll require actions to increment counters.
Unfortunately I think c++ style comments probably run afoul of this particular configuration of the regexes. At least if we want both forward and backward scan to work. Anyhow I think it is perhaps more difficult than would be worthwhile to try and detect this dynamically from any given set of regexes. But I do think it would apply to quite a few of languages.
I've read through the issue again and I think I might understand it slightly better now (but I might be wrong!).
I think what you're hitting is a semi-implemented aspect of lrpar. The ultimate aim is that you should be able to parse a file by implementing just Lexer::iter
which doesn't care how you store things underneath (this is very deliberate). However, at the moment, lrpar::parser
requires you to implement the much bulkier NonStreamingLexer
trait. The reason for this is that I only realised late in the day that Lexer::iter
could/should be separated out.
So it might be interesting to work out how to do parsing with just the Lexer
trait. If we're lucky it might be a very simple change, but I honestly don't remember off-hand!
Yeah, this issue is going to be tough, because it is really hitting unfinished work in both the regex front and the lexer front, It is probably worth tabling the idea entirely within the scope of nimbleparse_lsp, even though string copies represent the bulk of it's profiling.
So it's probably worth trying e.g. in an ad-hoc lexer with an ad-hoc regex crate specifically for that purpose, and only then once I've got something working try to think about how we could integrate it into upstream crates like grmtools and regex?
So it's probably worth trying e.g. in an ad-hoc lexer with an ad-hoc regex crate specifically for that purpose
That does seem like a sensible plan. I will have a look to see if I can easily update lrpar to do "the right thing", but it might exceed my time budget.
I've had a look into this and, sort of amusingly, NonStreamingLexer
will cause us some problems because it's both non-streaming and gives users an 'input
lifetime that they can use in action code: in a sense those are two orthogonal things squashed into one. I need to think about if/how these could/should be separated out.
currently when we go to lex we make a
&str
from the rope, and then use the usualLexer::new
. Especially when working with large files it'd be better to implLexer
or perhaps more appropriately NonStreamingLexer for Rope directly cloning it is cheap.It is likely also that instead of being implemented for Rope, these need to be implemented for Chars or in the case of NonStreamingLexer, using Lines. Generally this just affects what we need to newtype because of orphan rules. I'm guessing it is actually Chars
I'm not really worried about copies of the lex/yacc files themselves, e.g.
LRNonStreamingLexerDef
and producing that from a rope (orYaccGrammer
either), the traits don't seem to support that anyways.