orbitalquark / scintillua

Scintillua enables Scintilla lexers to be written in Lua, particularly using LPeg. It can also be used as a standalone Lua library for syntax highlighting support.
https://orbitalquark.github.io/scintillua
MIT License
53 stars 22 forks source link

Bash lexer: test (`[`, `[[`) flags check doesn't work and is also very slow #96

Closed rnpnr closed 9 months ago

rnpnr commented 1 year ago

One very common shell idiom is to have a test on a single line like the following:

[ $# -lt 3 ] && ...

Currently this will be missed since the pattern requires a space before the match (ie. if [ ...). This can be trivially fixed by:

diff --git a/lexers/bash.lua b/lexers/bash.lua
index f1365ae..439347b 100644
--- a/lexers/bash.lua
+++ b/lexers/bash.lua
@@ -71,7 +71,7 @@ local shell_op = '-o'
 local var_op = '-' * S('vR')
 local string_op = '-' * S('zn') + S('!=')^-1 * '=' + S('<>')
 local num_op = '-' * lexer.word_match('eq ne lt le gt ge')
-local in_cond_expr = in_expr{[' [[ '] = ' ]]', [' [ '] = ' ]'}
+local in_cond_expr = in_expr{['[[ '] = ' ]]', ['[ '] = ' ]'}
 local conditional_op = (num_op + file_op + shell_op + var_op + string_op) * #lexer.space *
   in_cond_expr

The slowness problem arises when there are many places in the input data where the in_expr() function needs to be called. For example when the input data is the configure script from vis. I think the main problem is that using match to split up a block of input data into lines is really inefficient. I'm not sure how to improve it while maintaining functionality. Perhaps there is a way of using lexer.range() to make it more efficient.

orbitalquark commented 1 year ago

Thanks for the thoughtful report. I'll investigate this when I have some time.

rnpnr commented 1 year ago

Here is a patch that gives a minor improvement but doesn't completely solve the problem:

diff --git a/lexers/bash.lua b/lexers/bash.lua
index f1365ae..6e6ae90 100644
--- a/lexers/bash.lua
+++ b/lexers/bash.lua
@@ -54,11 +54,11 @@ local op = S('!<>&|;$()[]{}') + lpeg.B(lexer.space) * S('.:') * #lexer.space

 local function in_expr(constructs)
   return P(function(input, index)
-    local line = input:sub(1, index):match('[^\r\n]*$')
+    local data = input:sub(1, index)
     for k, v in pairs(constructs) do
-      local s = line:find(k, 1, true)
+      local s, se = data:find(k, 1, true)
       if not s then goto continue end
-      local e = line:find(v, 1, true)
+      local e = data:match('[^\r\n]*$', se):find(v, 1, true)
       if not e or e < s then return true end
       ::continue::
     end
@@ -71,7 +71,7 @@ local shell_op = '-o'
 local var_op = '-' * S('vR')
 local string_op = '-' * S('zn') + S('!=')^-1 * '=' + S('<>')
 local num_op = '-' * lexer.word_match('eq ne lt le gt ge')
-local in_cond_expr = in_expr{[' [[ '] = ' ]]', [' [ '] = ' ]'}
+local in_cond_expr = in_expr{['[[ '] = ' ]]', ['[ '] = ' ]'}
 local conditional_op = (num_op + file_op + shell_op + var_op + string_op) * #lexer.space *
   in_cond_expr
orbitalquark commented 1 year ago

I've implemented the original suggestion in https://github.com/orbitalquark/scintillua/commit/96371094cf663ee7b14f60db1edb2964b3d4e156.

I'm not seeing any slowness on my end trying to parse the linked configure script. I suspect it's an implementation detail as to how text is fed to Scintillua. In my case, I'm not feeding in individual lines, but chunks of text that include multiple lines. The second patch suggests using local data = input:sub(1, index), but I think that would only work with a line-oriented approach. My chunk approach would end up looking at the first line in a multi-line chunk and highlight incorrectly as a result.

I am open to further improvements.

rnpnr commented 1 year ago

I'm not feeding in individual lines, but chunks of text that include multiple lines

Thats exactly what we do as well. Actually we pass very large chunks up to 32K. I think the main problem is that there used to be a cache and now there isn't. So every time we call lexers.load(...) we have to re-lex everything that we already lexed. We could check smaller chunks but that means we have to lex much more often then before.

but I think that would only work with a line-oriented approach

Not really. That patch checks if the data has k before going and searching for the end of a line. If data doesn't have k then there isn't a reason to check that both k and v are on the same line. Its just a small optimization.

Are there any plans to bring back the global lexer cache? I think scintillua is less robust without it. For example someone made a spellchecking plugin for vis that would wrap the lexer such that it would search for spelling mistakes only in comments and strings. That is no longer possible if we get a new lexer every time we call lexer.load()

orbitalquark commented 1 year ago

I'm not feeding in individual lines, but chunks of text that include multiple lines

Thats exactly what we do as well. Actually we pass very large chunks up to 32K. I think the main problem is that there used to be a cache and now there isn't. So every time we call lexers.load(...) we have to re-lex everything that we already lexed. We could check smaller chunks but that means we have to lex much more often then before.

Forgive me, because I'm not very familiar with vis, but couldn't you cache both the lexer you're using for a particular buffer/document as well as the last correctly styled position? Then after buffer/document edits, you'd invoke the cached lexer either at the last correctly styled position or the edit position (whichever is first). Scintilla/SciTE effectively does this (but backtracks to the start of the edited line), and I think Scintillua kind of assumes users do something like this too.

but I think that would only work with a line-oriented approach

Not really. That patch checks if the data has k before going and searching for the end of a line. If data doesn't have k then there isn't a reason to check that both k and v are on the same line. Its just a small optimization.

Applying your patch fails a unit test here: https://github.com/orbitalquark/scintillua/blob/ead19010d5390b1ef20bbc176793fc2b435b183a/tests.lua#L1153-L1158

The '/' in '/foo' is incorrectly highlighted as an arithmetic operator because (I think) "$((" was found in the lexed chunk, and the current line does not contain an ending "))", so the check succeeds. Without the patch, the current line does not contain a "$((", so '/' would not be considered an arithmetic operator.

Are there any plans to bring back the global lexer cache? I think scintillua is less robust without it. For example someone made a spellchecking plugin for vis that would wrap the lexer such that it would search for spelling mistakes only in comments and strings. That is no longer possible if we get a new lexer every time we call lexer.load()

I am hesitant to bring back the lexer cache because I am concerned about lexer integrity. If a developer has their application use the cache, a user could "poison" that cached lexer and it affects all other uses of it. I'm not necessarily thinking of malicious uses, but weird things could happen that would be hard to debug. Your example of a spellchecking wrapper could cause unintended side effects.

Perhaps I am being paranoid and this is a non-issue. I am open to the idea, but not sold on it yet.

rnpnr commented 1 year ago

Forgive me, because I'm not very familiar with vis, but couldn't you cache both the lexer you're using for a particular buffer/document as well as the last correctly styled position?

For now I will re-add the caching behaviour locally to vis. Personally I think it makes scintillua much more robust and would be better suited to keep around here but thats up to you. To me its on the user if they modify parts of the lexer and break something.

As for storing the last correctly styled position, its a little more complicated in vis. vis' most powerful editing feature is to use regexp to modify many parts of the file at once. After such an edit its much easier from our end to just re-lex everything that is in view + extra data after the view rather than selectively lex parts of the document. This works perfectly fine when the lexer is fast. This is still the case for all the lexers I have tested besides the bash lexer.

I wrote the following test to demonstrate this:

local lexers = require('lexer')

local fp = io.open('configure', 'r')
local data = fp:read('*a')
fp:close()

local total_time = 0
for i = 1, 10 do
    local stime = os.clock()
    local l = lexers.load("bash", nil, false)
    local toks = l:lex(data)
    total_time = total_time + os.clock() - stime
end

print("Average Lexing Time [s]: ", total_time / 10)

Again I'm using vis' configure script simply because it is a fairly typical non-trivial shell script. As for the results:

# Old Lexers
Average Lexing Time [s]:        0.0013101
# New Lexers
Average Lexing Time [s]:        0.4444334

This is strictly from the 'operator' rule. If I comment it out:

Average Lexing Time [s]:        0.0021991

I think a 400x increase is a little excessive.

rnpnr commented 1 year ago

By the way another edge case which will be cause half the line to be detected as a comment is:

[ ${#} -lt 3 ] && ...
orbitalquark commented 1 year ago

Thanks for the performance numbers. It may be time to just axe function calls in patterns, as it's way too expensive. This was the case for HTML: https://github.com/orbitalquark/scintillua/blob/ead19010d5390b1ef20bbc176793fc2b435b183a/lexers/html.lua#L29-L40

orbitalquark commented 1 year ago

I've disabled conditional and arithmetic operator highlighting here: https://github.com/orbitalquark/scintillua/commit/5802bd4b04eae25368b227c985544a074407091a I've fixed ${#} highlighting here: https://github.com/orbitalquark/scintillua/commit/0730c5ebd9261dfc4273b4501acfbab730999322

rnpnr commented 1 year ago

Thanks! I will include those in the next release of vis.

mcepl commented 9 months ago

So, this should be probably closed, right?

orbitalquark commented 9 months ago

Yeah, I guess so, thanks.