Closed rnpnr closed 9 months ago
Thanks for the thoughtful report. I'll investigate this when I have some time.
Here is a patch that gives a minor improvement but doesn't completely solve the problem:
diff --git a/lexers/bash.lua b/lexers/bash.lua
index f1365ae..6e6ae90 100644
--- a/lexers/bash.lua
+++ b/lexers/bash.lua
@@ -54,11 +54,11 @@ local op = S('!<>&|;$()[]{}') + lpeg.B(lexer.space) * S('.:') * #lexer.space
local function in_expr(constructs)
return P(function(input, index)
- local line = input:sub(1, index):match('[^\r\n]*$')
+ local data = input:sub(1, index)
for k, v in pairs(constructs) do
- local s = line:find(k, 1, true)
+ local s, se = data:find(k, 1, true)
if not s then goto continue end
- local e = line:find(v, 1, true)
+ local e = data:match('[^\r\n]*$', se):find(v, 1, true)
if not e or e < s then return true end
::continue::
end
@@ -71,7 +71,7 @@ local shell_op = '-o'
local var_op = '-' * S('vR')
local string_op = '-' * S('zn') + S('!=')^-1 * '=' + S('<>')
local num_op = '-' * lexer.word_match('eq ne lt le gt ge')
-local in_cond_expr = in_expr{[' [[ '] = ' ]]', [' [ '] = ' ]'}
+local in_cond_expr = in_expr{['[[ '] = ' ]]', ['[ '] = ' ]'}
local conditional_op = (num_op + file_op + shell_op + var_op + string_op) * #lexer.space *
in_cond_expr
I've implemented the original suggestion in https://github.com/orbitalquark/scintillua/commit/96371094cf663ee7b14f60db1edb2964b3d4e156.
I'm not seeing any slowness on my end trying to parse the linked configure script. I suspect it's an implementation detail as to how text is fed to Scintillua. In my case, I'm not feeding in individual lines, but chunks of text that include multiple lines. The second patch suggests using local data = input:sub(1, index)
, but I think that would only work with a line-oriented approach. My chunk approach would end up looking at the first line in a multi-line chunk and highlight incorrectly as a result.
I am open to further improvements.
I'm not feeding in individual lines, but chunks of text that include multiple lines
Thats exactly what we do as well. Actually we pass very large chunks up
to 32K. I think the main problem is that there used to be a cache and
now there isn't. So every time we call lexers.load(...)
we have to
re-lex everything that we already lexed. We could check smaller chunks
but that means we have to lex much more often then before.
but I think that would only work with a line-oriented approach
Not really. That patch checks if the data has k
before going and
searching for the end of a line. If data doesn't have k
then there
isn't a reason to check that both k
and v
are on the same line. Its
just a small optimization.
Are there any plans to bring back the global lexer cache? I think
scintillua is less robust without it. For example someone made a
spellchecking plugin for vis that would wrap the lexer such that it would
search for spelling mistakes only in comments and strings. That is no
longer possible if we get a new lexer every time we call lexer.load()
I'm not feeding in individual lines, but chunks of text that include multiple lines
Thats exactly what we do as well. Actually we pass very large chunks up to 32K. I think the main problem is that there used to be a cache and now there isn't. So every time we call
lexers.load(...)
we have to re-lex everything that we already lexed. We could check smaller chunks but that means we have to lex much more often then before.
Forgive me, because I'm not very familiar with vis, but couldn't you cache both the lexer you're using for a particular buffer/document as well as the last correctly styled position? Then after buffer/document edits, you'd invoke the cached lexer either at the last correctly styled position or the edit position (whichever is first). Scintilla/SciTE effectively does this (but backtracks to the start of the edited line), and I think Scintillua kind of assumes users do something like this too.
but I think that would only work with a line-oriented approach
Not really. That patch checks if the data has
k
before going and searching for the end of a line. If data doesn't havek
then there isn't a reason to check that bothk
andv
are on the same line. Its just a small optimization.
Applying your patch fails a unit test here: https://github.com/orbitalquark/scintillua/blob/ead19010d5390b1ef20bbc176793fc2b435b183a/tests.lua#L1153-L1158
The '/' in '/foo' is incorrectly highlighted as an arithmetic operator because (I think) "$((" was found in the lexed chunk, and the current line does not contain an ending "))", so the check succeeds. Without the patch, the current line does not contain a "$((", so '/' would not be considered an arithmetic operator.
Are there any plans to bring back the global lexer cache? I think scintillua is less robust without it. For example someone made a spellchecking plugin for vis that would wrap the lexer such that it would search for spelling mistakes only in comments and strings. That is no longer possible if we get a new lexer every time we call
lexer.load()
I am hesitant to bring back the lexer cache because I am concerned about lexer integrity. If a developer has their application use the cache, a user could "poison" that cached lexer and it affects all other uses of it. I'm not necessarily thinking of malicious uses, but weird things could happen that would be hard to debug. Your example of a spellchecking wrapper could cause unintended side effects.
Perhaps I am being paranoid and this is a non-issue. I am open to the idea, but not sold on it yet.
Forgive me, because I'm not very familiar with vis, but couldn't you cache both the lexer you're using for a particular buffer/document as well as the last correctly styled position?
For now I will re-add the caching behaviour locally to vis. Personally I think it makes scintillua much more robust and would be better suited to keep around here but thats up to you. To me its on the user if they modify parts of the lexer and break something.
As for storing the last correctly styled position, its a little more
complicated in vis. vis' most powerful editing feature is to use regexp
to modify many parts of the file at once. After such an edit its much
easier from our end to just re-lex everything that is in view + extra data
after the view rather than selectively lex parts of the document. This
works perfectly fine when the lexer is fast. This is still the case for
all the lexers I have tested besides the bash
lexer.
I wrote the following test to demonstrate this:
local lexers = require('lexer')
local fp = io.open('configure', 'r')
local data = fp:read('*a')
fp:close()
local total_time = 0
for i = 1, 10 do
local stime = os.clock()
local l = lexers.load("bash", nil, false)
local toks = l:lex(data)
total_time = total_time + os.clock() - stime
end
print("Average Lexing Time [s]: ", total_time / 10)
Again I'm using vis' configure
script simply because it is a fairly
typical non-trivial shell script. As for the results:
# Old Lexers
Average Lexing Time [s]: 0.0013101
# New Lexers
Average Lexing Time [s]: 0.4444334
This is strictly from the 'operator'
rule. If I comment it out:
Average Lexing Time [s]: 0.0021991
I think a 400x increase is a little excessive.
By the way another edge case which will be cause half the line to be detected as a comment is:
[ ${#} -lt 3 ] && ...
Thanks for the performance numbers. It may be time to just axe function calls in patterns, as it's way too expensive. This was the case for HTML: https://github.com/orbitalquark/scintillua/blob/ead19010d5390b1ef20bbc176793fc2b435b183a/lexers/html.lua#L29-L40
I've disabled conditional and arithmetic operator highlighting here: https://github.com/orbitalquark/scintillua/commit/5802bd4b04eae25368b227c985544a074407091a
I've fixed ${#}
highlighting here: https://github.com/orbitalquark/scintillua/commit/0730c5ebd9261dfc4273b4501acfbab730999322
Thanks! I will include those in the next release of vis.
So, this should be probably closed, right?
Yeah, I guess so, thanks.
One very common shell idiom is to have a test on a single line like the following:
Currently this will be missed since the pattern requires a space before the match (ie.
if [ ...
). This can be trivially fixed by:The slowness problem arises when there are many places in the input data where the
in_expr()
function needs to be called. For example when the input data is the configure script from vis. I think the main problem is that usingmatch
to split up a block of input data into lines is really inefficient. I'm not sure how to improve it while maintaining functionality. Perhaps there is a way of usinglexer.range()
to make it more efficient.