ratmice / nimbleparse_lsp

lsp server for grmtools lex & yacc
Other
7 stars 1 forks source link

impl Lexer for rope #9

Open ratmice opened 2 years ago

ratmice commented 2 years ago

currently when we go to lex we make a &str from the rope, and then use the usual Lexer::new. Especially when working with large files it'd be better to impl Lexer or perhaps more appropriately NonStreamingLexer for Rope directly cloning it is cheap.

It is likely also that instead of being implemented for Rope, these need to be implemented for Chars or in the case of NonStreamingLexer, using Lines. Generally this just affects what we need to newtype because of orphan rules. I'm guessing it is actually Chars

I'm not really worried about copies of the lex/yacc files themselves, e.g. LRNonStreamingLexerDef and producing that from a rope (or YaccGrammer either), the traits don't seem to support that anyways.

ratmice commented 2 years ago

FWIW, It looks like this would require reimplementing LRNonStreamingLexerDef, i'm not yet sure how much that entails exactly, the reason is primarily because we can't reimplement LRNonStreamingLexerDef::lexer while the Rule::re is private.

I think however, that I probably may have needed to reimplement that eventually, in order have access to spans for Rule's name and re_str fields of LRNonStreamingLexerDef, pointing referencing the l_file.

ratmice commented 2 years ago

@ltratt any chance you could have a look at the branch https://github.com/ratmice/nimbleparse_lsp/tree/issue9 before I go off into the weeds and start reimplementing, LRNonStreamingLexerDef,

because

  1. it would be a lot less duplication if this could be done in grmtools with a rope_lexer feature/function
  2. adding Span's to name and re_str don't seem terribly controversial.

Question: besides the above questions, there is a simple question about whether span_lines_str looks ok, I interpreted the docs to mean the span is containing line numbers rather than a byte index, but the docs and warnings about byte indices therein seem misleading.

ltratt commented 2 years ago

I'm not sure I've got enough context to understand the question. It might be relevant to note that the reason for the unwieldy LRNonStreamingLexerDef name is because one day I imagined lrpar supporting LRStreamingLexerDef (and possibly other variants) which would be more annoying to use but more flexible (for those who need it). Is that the sort of thing you're working towards here?

Question: besides the above questions, there is a simple question about whether span_lines_str looks ok, I interpreted the docs to mean the span is containing line numbers rather than a byte index, but the docs and warnings about byte indices therein seem misleading.

Spans are always defined in terms of bytes, not characters or lines -- there's no other way to do efficient string things with utf-8. Note that line_col returns (line, char) indexes but its input Span must still be in terms of bytes.

ratmice commented 2 years ago

Sorry, these were mostly notes to self until I ran into the need for private API, will try to explain more clearly:

Rope is a tree-like data structure for strings, instead of storing a string in a contiguous buffer, it'll store slices of strings in a tree. This tree has in addition to the slice has some metadata for each node, like its offset/length in bytes, number of lines in the slice. This allows for basically O(log n) editing, and a very quick clone, by copying the parent node, which has just a pointer to the rest of the tree. Where mutation is done by replacing the node which is edited with a new node.

In that way regarding Streaming/NonStreaming it is no different than string, so it also would qualify as a NonStreamingLexer too. I don't actually want to reimplement LRNonStreamingLexerDef, I was just going to reuse the existing LrNonStreamingLexerDef, and tried to build LRNonStreamingLexer and Lexer traits for Rope (this is LRNonSreamingRopeLexer, but the constructor for LRNonStreamingLexer is LRNonStreamingLexerDef::lexer which has a rule.re regex field private. I can perhaps just create the re directly from the re_str (I need to put fresh eyes on that to see that could be done efficiently without recreating regexes multiple times).

Some background: In the LSP we store data internally using Rope, the editor sends us 'incremental' changes to text, just the edited text and it's position. After that, we call parse_file, which currently converts the rope using to_string which will make a clone of the entire string.

So this was aimed at just lexing directly on the rope, bypassing that initial clone, but still non-streaming.

Spans are always defined in terms of bytes, not characters or lines -- there's no other way to do efficient string things with utf-8. Note that line_col returns (line, char) indexes but its input Span must still be in terms of bytes.

Ahh, I get it now, it doesn't interpret the span as lines, but pads the beginning and end of the returned string to a line boundary. I believe that I totally missed the parenthetical in the docs which is very clear. Apologies.

ratmice commented 2 years ago

Edit removed old comment it doesn't seem relevant anymore:

I've kind of given up for the time being on doing this without modifying the lrlex library, e.g. by deriving traits from lrlex from within nimbleparse_lsp. It doesn't seem possible without stablizing a bunch of things that don't want to be stablized inbetween LRNonStreamingLexerDef and the impl of Lexer to use a custom Lexer impl.

I did though, produce a nice small branch of lrlex with a rope feature. https://github.com/ratmice/grmtools/tree/rope_lexer https://github.com/ratmice/nimbleparse_lsp/tree/issue9-take2

I need to do performance comparisons and sanity testing, but so far it appears to work

Edit: It is definitely not correct, sometimes (hitting unwrap panic) but its a start. I fixed the implementation, unfortunately it is a serious performance regression, Consider a list of chunks for the string "abcdef", ["abc", "def"], because of the way the regex works it ends up coping: a.., b.., c.. so we actually end up doing more copying to rebuild a strings from it to pass to regex.

So at this point we are better off converting to strings and doing a single copy up front.

There are some comments to linking to some bug reports, which has some links to others work on regex over ropes. I'm not aware of anything stable though I'll have to look through it more.

ratmice commented 2 years ago

So after a working but performance regressing implementation I have what I consider to be 3 options:

  1. try a regex based on regex derivatives, this would get us a LRStreamingLexer I suppose such as Matt Might describes.
  2. double down that this is not a streaming problem but an iteration problem in the regex crate. But due to the fact that iterator isn't equipped iterate over borrowed slices, and then reiterate over them in a next().next().prev() == next() sense. I'd need to move to basically a LendingIterator over slices, where next() and prev() are bijective, perhaps exact size.
  3. shelve it for the time being since both options seem like quite a bit of work.
ltratt commented 2 years ago

A few quick thoughts -- these might be irrelevant or obvious but just in case:

  1. IMHO it's easier to write a good parser than it is a good lexer: there's more weird variability in lexers than parsers. lrlex only works well for certain styles of languages. That's why the lexing API is inside lrpar, not lrlex: the expectation is that users will want to substitute very different lexers.
  2. Parsing mostly shouldn't care whether it's operating on a string or something merely string-like. It might be possible to generalise that lrpar's APIs exposes such that you can pass them a string or a rope.
  3. The reason why cfgrammar and lrtable are separate from lrpar is because I think a) LR parsing isn't all parsing (cfgrammar was built to be extendable for non-LR parsers) b) lrpar won't cover all types of LR parsing well (lrtable, however, is hopefully good enough for nearly all types of LR parsing). It might be that lrpar is too restrictive for you, but that you can still make use of cfgrammar and lrtable as-is.
ratmice commented 2 years ago

So I had a thought about this one which is probably worth trying, at least if only to tell us how much improvement lexing ropes might bring without bringing in a bunch of regex changes.

If the version of rope we used ended slices on a word boundary, e.g. between a word and the whitespace after it, you should be able to lex over slices because the vast majority of regular expressions would no longer span 2 slices.

It probably entails building a rope with this property, but seems a lot simpler than building regex engine which can continue across slices :shrug: Edit2: Incremental regular expressions.

Edit: I guess the difficulty here may be finding a good boundary when performing an edit operation, It could be that you end up splitting an edit across slices to enforce the slice boundary property in a way which makes you do an extra split. Another difficulty is strings, and things such as comments, or blobs which may contain whitespace...

ltratt commented 2 years ago

If the version of rope we used ended slices on a word boundary, e.g. between a word and the whitespace after it

If you want to support whitespace sensitive languages (e.g. Python or Haskell) then this might not work. Parsing is... fun ;)

ratmice commented 2 years ago

Indeed, it is just an optimization in a particular configuration of the lexer regexes, in which case we'd fall back to the full string copy like we currently use. But I don't think it works with whitespace sensitive languages currently, because generally you'll require actions to increment counters.

Unfortunately I think c++ style comments probably run afoul of this particular configuration of the regexes. At least if we want both forward and backward scan to work. Anyhow I think it is perhaps more difficult than would be worthwhile to try and detect this dynamically from any given set of regexes. But I do think it would apply to quite a few of languages.

ltratt commented 2 years ago

I've read through the issue again and I think I might understand it slightly better now (but I might be wrong!).

I think what you're hitting is a semi-implemented aspect of lrpar. The ultimate aim is that you should be able to parse a file by implementing just Lexer::iter which doesn't care how you store things underneath (this is very deliberate). However, at the moment, lrpar::parser requires you to implement the much bulkier NonStreamingLexer trait. The reason for this is that I only realised late in the day that Lexer::iter could/should be separated out.

So it might be interesting to work out how to do parsing with just the Lexer trait. If we're lucky it might be a very simple change, but I honestly don't remember off-hand!

ratmice commented 2 years ago

Yeah, this issue is going to be tough, because it is really hitting unfinished work in both the regex front and the lexer front, It is probably worth tabling the idea entirely within the scope of nimbleparse_lsp, even though string copies represent the bulk of it's profiling.

So it's probably worth trying e.g. in an ad-hoc lexer with an ad-hoc regex crate specifically for that purpose, and only then once I've got something working try to think about how we could integrate it into upstream crates like grmtools and regex?

ltratt commented 2 years ago

So it's probably worth trying e.g. in an ad-hoc lexer with an ad-hoc regex crate specifically for that purpose

That does seem like a sensible plan. I will have a look to see if I can easily update lrpar to do "the right thing", but it might exceed my time budget.

ltratt commented 2 years ago

I've had a look into this and, sort of amusingly, NonStreamingLexer will cause us some problems because it's both non-streaming and gives users an 'input lifetime that they can use in action code: in a sense those are two orthogonal things squashed into one. I need to think about if/how these could/should be separated out.