sharkdp / bat

A cat(1) clone with wings.
Apache License 2.0
49.36k stars 1.25k forks source link

slow output, with pauses, with c# source code #677

Closed vvuk closed 4 years ago

vvuk commented 5 years ago

Something strange is going on. On a 1500-line .cs file, bat takes ~7s to list the file. --pager none, both with and without -p (not much difference with -p). I can see a pause after every1-2 dozen lines are output. --colors never also doesn't make a difference (bench via hyperfine):

Command Mean [ms] Min [ms] Max [ms] Relative
cmd /c type EntityComponentStore.cs 86.1 ± 7.3 75.9 101.1 1.0
bat --pager none EntityComponentStore.cs 7734.8 ± 586.6 6857.4 8700.1 89.9
bat --pager none -p EntityComponentStore.cs 6982.9 ± 464.3 6290.5 7889.4 81.1
bat --color never --pager none EntityComponentStore.cs 7155.5 ± 460.3 6586.1 7804.1 83.1
bat --color never --pager none -p EntityComponentStore.cs 7053.2 ± 492.3 6409.1 8076.1 82.0

bat 0.12.1, msvc build on Windows, in a regular cmd.exe console window.

I'd expect some slowdown but it seems like something buggy is happening. If I do the same on a cpp file, also ~1500 lines:

Command Mean [ms] Min [ms] Max [ms] Relative
cmd /c type ApiDispatchers.cpp 98.4 ± 10.0 83.6 115.9 1.0
bat --pager none ApiDispatchers.cpp 768.5 ± 88.2 677.9 995.4 7.8
bat --pager none -p ApiDispatchers.cpp 683.8 ± 128.9 549.4 959.0 6.9
bat --color never --pager none ApiDispatchers.cpp 480.1 ± 46.7 440.2 584.2 4.9
bat --color never --pager none -p ApiDispatchers.cpp 447.6 ± 42.5 414.6 542.0 4.5

Which is much more in line with what I'd expect. Is there something different about how the syntax highlighting for C# is handled? Is it all regexp or something else expensive based?

vvuk commented 5 years ago

This isn't the C# file I used for benchmarking, but it's a long one (~3k loc) and behaves the same. It doesn't seem to be file dependent, just more obvious on longer files. CSharpCompilation.cs

vvuk commented 5 years ago

I should say both of these are with --show-output to hyperfine. Without it, things are very different, unless I force --color always:

Command Mean [ms] Min [ms] Max [ms] Relative
cmd /c type EntityComponentStore.cs 14.6 ± 0.9 13.8 22.1 1.0
bat --pager none EntityComponentStore.cs 77.5 ± 1.4 75.8 83.1 5.3
bat --color always --pager none EntityComponentStore.cs 6209.7 ± 582.2 5519.8 7555.6 418.2
eth-p commented 5 years ago

I can confirm this bug on Windows 10 with bat 0.12.1. I can also confirm this on macOS.

The following behaviours have been identified with bat and the file provided:

Command Choppy
bat file.cs Yes
bat file.cs -pp Yes
bat file.cs --color=never Yes
bat file.cs --color=never > NUL No
bat file.cs --color=never --decorations=never Yes
bat file.cs --color=never --decorations=never > file.txt Yes
bat file.cs --color=always --decorations=always > file.txt Yes

Trying with a large file of a different language (e.g. the saved HTML from here), I don't notice this issue in any of the above cases.

If we consider the commands used for the C# file in the above table, the only times where the issue wasn't present is when the simple printer is used (i.e. when the highlighter isn't used at all).

Given the above, I agree that this appears to be an issue specifically with C# syntax highlighting. Since I'm not too familiar with the Syntect library, @sharkdp is likely going to need to be the one to figure out the reason why this is happening.

sharkdp commented 5 years ago

@vvuk Thank you for the detailed bug report!

If we consider the commands used for the C# file in the above table, the only times where the issue wasn't present is when the simple printer is used (i.e. when the highlighter isn't used at all).

Yes, exactly. Thank you for looking into this.

Given the above, I agree that this appears to be an issue specifically with C# syntax highlighting. Since I'm not too familiar with the Syntect library, @sharkdp is likely going to need to be the one to figure out the reason why this is happening.

There have been issues with the C# syntax in the past (e.g. https://github.com/trishume/syntect/issues/63), but this seems to be a new issue (we already include the updated C# syntax - at least on master, where the problem is still present).

This should probably be reported upstream. The same thing happens when using syncat, which is bundled with syntect.

@keith-hall Notifying you, just in case you are interested (please let me know if you are not, in which case I am sorry and will stop pinging you!)

keith-hall commented 5 years ago

Most likely it is another instance of a regex pattern used (perhaps often) in the C# syntax definition which has poor performance under the Oniguruma regex engine with many (perhaps simple) inputs. Probably the best way to start debugging this would be to play around in syntect to temporarily add some timers to see which patterns are problematic and then identify if there is a way to tweak those for better performance and make a PR to the sublimehq/Packages repository with those improvements. I will endeavor to do this when I am able to make time, but it may not be soon, so anyone else whom wishes to experiment/investigate: please do :)

otherwise, potentially relevant (currently open) upstream issues are:

vvuk commented 5 years ago

Ah, so I was going to look into this in syntect, however current master there (3.3.0) doesn't show the problem with syncat (this is on a good bit slower machine than the earlier benchmarks):

Command Mean [s] Min [s] Max [s] Relative
cargo run --example syncat C:\proj\tmp\CSharpCompilation.cs 2.122 ± 0.013 2.110 2.151 1.0
bat --color always --pager none c:\proj\tmp\CSharpCompilation.cs 12.697 ± 0.886 11.395 14.000 6.0

I'm not sure what version of syntect my bat was built with, but either this was fixed upstream already, or there's actually something that bat is doing that's causing the slowdown?

vvuk commented 5 years ago

Oh I forgot to mention -- the above was a debug build of syntec. A release build takes 500ms.

keith-hall commented 5 years ago

@vvuk just to check, are you testing syntect/syncat with the submodules as of the master branch on that repo, or have you updated them to match the version/state/commit which bat is using? (and rebuilt the syntaxset "dump"?)

vvuk commented 5 years ago

@keith-hall with the submodules as of the master branch.

 870081c447a61df93290932e307ba5d1890f7e8e syntect
 18ddb271179e118cfc2dd83abf88b915b7328a25 testdata/InspiredGitHub.tmtheme (v1.3.0)
 fa6b8629c95041bf262d4c1dab95c456a0530122 testdata/Packages (v3154-4-gfa6b8629)
 bcd6234b4f5f96d3fd27db079268b5757053072a testdata/Solarized (1.5.11)
 2703e93f559e212ef3895edd10d861a4383ce93d testdata/spacegray (1.3.2-1-g2703e93)

That syntect commit (870081c) is tagged as v3.3.0, which is what bat is using. I didn't confirm that they're using the exact same bits though, my bat build was pulling syntect from cargo. I'll rebuild locally for a sanity check.

I didn't rebuild the syntaxset (just cloned syntect, and did a cargo build --examples)

keith-hall commented 5 years ago

syntect v3.3.0 and the master branch are both referencing a 2 year old Packages submodule (where bat is now using a recent version), so likely some change to the C# syntax definition since then has caused this difference

vvuk commented 5 years ago

Definitely an issue with the C# syntax files. Using the C# syntaxes that bat is using (via -e -s), I'm getting the 12s times with syncat. There are a bunch of changes in the syntax file; I can attempt to narrow down which in particular is causing the problem.

vvuk commented 4 years ago

(I haven't had a chance to do the narrowing down unfortunately.)

For upstream reporting, should this be reported to the sublimehq/Packages repo, or to syntect? If sublime had the same performance issues I'd guess someone would have complained/noticed, so maybe this is a syntect issue?

sharkdp commented 4 years ago

I've looked into this today.

First, I tried to minimize the example file above. I found that lines like this:

return a(bbbbbbbbbbbbbbbbbbbbbbbbbbbb.c(dddddddddddddddddddddddddddddddddddd));

take a very long time to parse. This particular one takes ~500 ms and the parsing time depends on the length of the b…b and d…d strings (in a somewhat complex way).

Next, I took a look at the C#.sublime-syntax. The problematic pattern is in the line_of_code_in_no_semicolon context:

    - match: \((?=(?:[^,)(]*|\([^\)]*\))*,)
      scope: punctuation.section.group.begin.cs
      push:
        # [...]

My (completely non-expert) guess would be that the nested (something*)* pattern causes some kind of combinatorial explosion.

Maybe @keith-hall could take a look at this (only if you are interested, of course)? :blush:

If this pattern is disabled, C# files render reasonably fast:

bat CSharpCompilation.cs --color=always
Command Mean [s] Min [s] Max [s] Relative
master 9.555 ± 0.038 9.487 9.615 1.00
modified syntax 0.306 ± 0.006 0.297 0.319 1.00
sharkdp commented 4 years ago

If I enter this pattern into regex101 and use the appropriate part of the line above as a test string, regex101 actually says "Catastrophic backtracking has been detected".

https://regex101.com/r/O2AcN7/1

keith-hall commented 4 years ago

may I suggest changing that pattern to \((?=(?:[^,)(]|\([^\)]*\))*,), it has the same functionality but without the catastrophic backtracking. I'm not sure when I'll get time to open a PR on the sublimehq/Packages repo, so if this change works for you, feel free to beat me to it :)

(I proved that the C# syntax tests aren't adversely affected, nor is the performance in Sublime Text)

sharkdp commented 4 years ago

Fantastic, thank you very much for the fast response!

This leads to the same time (slightly faster here because I was using a user-specified cache-file above - here, the patched C# syntax is built into the bat binary):

Command Mean [s] Min [s] Max [s] Relative
bat CSharpCompilation.cs --color=always 0.256 ± 0.003 0.251 0.263 1.00

I'm going to open a PR in sublimehq/Packages.

Edit: done: https://github.com/sublimehq/Packages/pull/2331

sharkdp commented 4 years ago

This has been fixed in bat v0.14.