Unable to lookahead/lookbehind accross tokens

kueblc / LDT

Lightweight in browser syntax highlighting

https://kueblc.github.io/LDT/

159 stars 17 forks source link

Unable to lookahead/lookbehind accross tokens #16

Open SuperCuber opened 3 years ago

SuperCuber commented 3 years ago

Say I want to make a parser for a markdown file but it's able to contain tags/annotations like @foo(bar). I want to color the contents of bar but only if there's a @foo before it - so I tried something like this:

        var parser = new Parser({
            whitespace: /\s+/,
            tagKey: /@\w+/,
            tagValue: /(?<=@\w+)\([^\)]+\)/,
            other: /\S/
        });

But this did not work. When I read the api for a parser, I realized that it's probably first breaking them into separate tokens and then applying the regexes to recognize the type of each token.

Is there a way to achieve what I want or does the api not support this? Maybe you could have support for parsers that can do both .tokenize() and .identify() in one go which would allow this?

SuperCuber commented 3 years ago

When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.

kueblc commented 3 years ago

Hi @SuperCuber thanks for sharing your thoughts.

For performance reasons, the current parser does not support look behind or look ahead regular expressions.

You might be able to work around this using the CSS sibling selector, ie

.tagKey + .tagValue {
    color: red;
}

Let me know if this works for you and I can add this to the examples.

When I think about, I'm not quite sure why tokenize and identify are different steps - don't you know the type of the token while you produce it? Seems strange to me, I might be missing something.

We construct a single RegExp from the individual rules, which allows for very fast tokenization, but does not tell us which source rule produced the token. We then run identify only on the tokens that changed. This was the most performant model explored when originally developing this library in ~2010, though I am certainly open to exploring other options.

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

The focus of this library is to be lightweight and fast enough to do live highlighting on low end machines, so any enhancements that have the potential to impact file size or performance must be optional. We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.

SuperCuber commented 3 years ago

We can include multiple parsing models and let the end developer choose one which fits the performance and capability requirements of their project.

I think this is the way to go. The most straightforward way to do this that I see is to add a method that will do both steps (that on Parser will just call tokenize then identify, giving basically no performance loss + backwards compatibility for custom parsers)

Then, if you choose to implement new parsers, you can do it in a new file, making the whole set of changes opt-in not only on the api level but also on the file size level.

kueblc commented 3 years ago

The most straightforward way to do this that I see is to add a method that will do both steps (that on Parser will just call tokenize then identify

I appreciate your input. I believe we will need to keep these separate so that we are not running identify in cases where tokens have not changed. Let me do some deeper thinking on this.

Did you try the workaround using CSS sibling selectors? Let me know if this is a viable strategy or if there are issues I am not seeing with this approach.

SuperCuber commented 3 years ago

Did you try the workaround using CSS sibling selectors?

It's going to be a bit fiddly. For example, if I wanted to color only the text inside the brackets I'm not sure I how I would do that, since I want it to continue until the closing bracket but a regex of [^\)]+ would match stuff from the start of the whole text... And if I tried something like having a .word that wouldn't help much either since I can't say .tagKey + .leftParen + .word* in css.

Also there isn't a way in css to lookahead (according to this)

SuperCuber commented 3 years ago

Let me do some deeper thinking on this

So did you come up with some ideas? I'm interested in this feature.

kueblc commented 3 years ago

Yes I had a few ideas I'm experimenting with. Might take me bit due to work and life. I'm pretty booked up for the next week or so.

One idea is to pass identify some sort of context, such as the preceding and following tokens, the position in the original string, or index in the token stream. I'd want to test these ideas for functionality and performance in most of the supported browsers.

Feel free to share any of your own ideas. I'll hopefully have something to roll out by the end of the month.

SuperCuber commented 3 years ago

preceding and following tokens

I assume that at least the following tokens are not .identifyed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...

the position in the original string

I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)? Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...

kueblc commented 3 years ago

preceding and following tokens

I assume that at least the following tokens are not .identifyed yet, so if you wanted to do that you'd need to re-implement identifying those tokens inside your identify function which is weird...

There may be a misunderstanding in how the process works. tokenize is responsible for breaking up the original string into an array of substrings. Then the (unidentified) substrings are individually passed to identify which determines the token type.

tokenize should already be able to handle look behind/ahead, but the identify step will fail because it lacks context. So the idea is to provide the substring to identify, along with the substrings that surround it. The result of identify does not depend on other results, so it doesn't matter whether we run identify on the surrounding substrings first or at all.

the position in the original string

I don't think that helps much unless you expect the implementation of the function to then go to the original string and try breaking it into tokens on its own, then locate the tokens it wants to check based on the character's index? That seems really convoluted and inefficient to me

We can leverage RegExp to test for a match at a given position. This is just one idea, it will certainly need to be evaluated for performance.

I'm also exploring performant grammar parsing, which may also address your needs and open the door to all sorts of advanced introspection capabilities.

This seems interesting, can you elaborate on how you imagine it would be used (say I want to write my own grammar)? Maybe you can leverage an existing library for it or even allow users to somehow hook up their library/implementation of choice...

This is still very much in the exploration phase. A lot of work remains to be done to make this possible to do in real time. I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.

SuperCuber commented 3 years ago

the idea is to provide the substring to identify, along with the substrings that surround it. The result of identify does not depend on other results

But it does - if it's not looking at the other substrings then what's the point. What I mean is, say I want to color the type of the variable in this rust snippet:

let my_var: MyType = create_new_mytype();

I'd need to define .tokenize in a very basic way - for example, break on every word boundary. Then I'd need to, in .identify("MyType"), check the previous token and see that it's :, then check the one before and see that it's a proper variable identifier, and check the one before and see that it's let, then check the one after and see that it's =...

So a lot of my logic for identifying a type token is actually logic for identifying the other tokens... Which IMO is a bit strange.

I guess this is just a bit too complex for a basic regex-based parser and you'd want proper grammar-based parsers for this kind of thing... But then an .identify with context doesn't really serve much purpose IMO - it's unneeded for a simple regex parser and not strong enough for a more complicated one.

A lot of work remains to be done to make this possible to do in real time.

I see, I didn't really take into account that these parsers need to work at actual typing speed, so a generic "give me a string and I'll break it into parts" parser just won't be fast enough.

I'm looking into ways to produce partial parse trees and partial updates. Also investigating doing fast (best guess) updates in between full updates to make this possible.

Yeah the parser would need to support half-typed unrecognized (yet) tokens, maybe even in the middle of the text... Seems like a pretty hard problem to do it efficiently.

I guess what I'm looking for in the meantime is a way to make it "just work" with my more complex logic because it simply doesn't fit into the 2-step process... If I could define my own function that updates the colorful elements on input for example, I could use that to at least make it work slowly.

pygy commented 8 months ago

@kueblc I have an idea for a solution that should be reasonably performant, and open the door to optimizations on update. It would involve either prohibiting captures in the lexer, or parsing the regexps to identify captures and take them into account. This can be done with little code:

// taken from https://github.com/compose-regexp/compose-regexp.js
// it is reasonably well tested
const captureMatcher = /\\[^]|\(\?(?::|<?[=!])|[\[\](]/g

function countCaptures(regex) {
    let count = 0, inCClass = false, result
    captureMatcher.lastIndex = 0
    while(result = captureMatcher.exec(regex.source)) {
        if (!inCClass) {
            if (result[0] === '(') count += 1
            continue
        }
        if (result[0] === '[') inCClass = true
        else if (result[0] === ']') inCClass = false
    }
    return count
}

From there, you can add an empty capture after the matcher of each lexeme, and look for captures.indexOf("", 1) and map it back to the correct category. That would let you merge the tokenize and identify phases (with regex.exec(src) in a loop as I do above).

If you keep an IR around (an array of {from, kind, match} objects), you can then, when the content changes, identify the range of tokens that are affected, and re-tokenize only that subset (unless the tokenization overflows, e.g. if a quote was inserted).

Array.prototype.indexOf can be easily emulated if IE8 compat still matters.

I have another suggestion for the parser. You could do away with requiring users to provide the final other :/\S/ category by automatically appending |() at the end of the lexer. That would make the lexer behave like a sticky regexp. If that last capture matches, it means the other failed. Bump the lastIndex and retry. Leave the corresponding span without a class.

Edit: I just realized that the above scheme only works if the lexers provided by the user don't contain empty captures (edit again, actually, it can be worked around).

Edit again: concretely, Parser({a:/a/, b:/b/}) would use scanner = /a()|b()|()/g. You can set its lastIndex, result = scanner.exec(source). When matching "a", it would result in a capture array that looks like [a, '', null, null]. A downside is that an array is created for each match, that can make quite a bit of garbage. One could abuse string.replace for doing the scans, but that would entail using new Function() in order to create a "replacer" dynamically with the right number of parameters, which is a no-no in many scenarios.