Improve bat startup speed

The startup speed of bat is currently (v0.15) around 50 ms, which is on (or even past) the edge of being noticeable by humans.

It would be great if this situation could be improved.

The startup speed of bat can be measured with hyperfine:

hyperfine \
  --warmup 10 \
  --export-markdown bat_startup.md \
  'bat --no-config --color=always'

On my laptop, this results in:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`bat --no-config --color=always`	45.3 ± 0.7	44.0	47.4	1.00

If we use perf to get a profile

perf record --call-graph dwarf bat --no-config --color=always < /dev/null

we can see that most of the time is spent deserializing the stored syntaxes and themes via bat::assets::assets_from_cache_or_binary.

There are several ideas that come to mind which could improve the situation:

Somehow parallelize the deserialization process (maybe by splitting up the binaries into smaller chunks. After all, they are made out of ~100 different syntaxes and themes)
Only load syntaxes and themes "on demand". If syntaxes and themes were serialized separately, we might be able to load them on demand. It's probably not as easy as it sounds, because syntaxes are interlinked (code blocks in markdown files could require the loading of all kinds of syntaxes, for example).
Try to look into ways to improve the deserialization speed.
The assets are currently stored in compressed form. I think I have tried storing them in uncompressed form once (and didn't see any performance improvements), but it might be worth re-evaluating that.

We should also validate that we are indeed CPU bound here. The bat binary is quite large, due to the fact that all syntaxes and themes are included within the binary. It might take some time to simply load that from disk (or cache).

Adding some profiling data from MacOS Mojave (with a SSD):

Note: The numbers are going to be off (i.e. about 2x worse) due to the added overhead from running a profiler.

Build: RUSTFLAGS=-g cargo build --release Command: target/release/bat --no-config --paging=never empty.cs

Total:

Syntax Loading:

Disk Latency:

Disk Usage:

Allocations:

It looks like around 31% of the time spent starting up was loading the executable image and initializing it (on MacOS), and around 63% of the total time (95% of main()) was parsing the syntax set, with 42% (60% of main()) being spent deserializing data using serde. I don't quite know where the extra 26% went though, unfortunately.

Meanwhile (based on the disk usage graph), only 2 milliseconds were spent paging in the executable.

I'm also inclined to say that the size of the executable/time spent paging isn't too much of an issue for us currently. 2 milliseconds loading the serialized asset data is significantly less than the 108 milliseconds spent deserializing it.

Additional things to note: Although I passed --no-config as a command line flag, bat still opened the config file. Was that intended?

I have made some progress on improving bat startup time through some prototyping work, and I would like to share that prototype here. I would love to get some feedback on it. Especially critisism!

The code for the prototype is here: https://github.com/Enselic/bat/compare/6ef2bb3283e1ba5f41316...Enselic:startup-speed-prototype-v1?expand=1

First, allow me to present some performance numbers from my (low-end) machine:

File under ./tests/syntax-tests/source	bat master	my prototype
example.zig	108.5 ms	25.8 ms
example.xml	102.1 ms	45.1 ms
example.md	123.1 ms	57.5 ms

One major limitation of the prototype is that I only made it work for --language, and not for file extension or first-line. So for example, I use this command to benchmark .xml:

hyperfine 'bat --no-config --pager=never --color=always ./tests/syntax-tests/source/XML/example.xml --language xml'

It should be pretty straightforward to make it work for e.g. file extensions too, but I don't want to spend time on it until I have gotten an external sanity check on my overall approach. For now, I simply fallback to loading the full SyntaxSet in these cases.

We can see from the benchmark that small syntaxes such sa Zig, which only contains the Zig Syntax Definition, is pretty fast. Medium sized SyntaxSet such as XML (contains ["xml", "xsd", "xslt", "tld", "dtml", "rng", "rss", "opml", "svg"]) is slower, and even larger ones such as Markdown is even slower. But still a nice improvement I would say!

Some notable positive properties of the prototype:

All non-ignored cargo test tests passes.
Most (but not all) syntax regression tests passes.

Some notable negative properties:

The stripped release binary size grows from 4,7M to 7,7M. Can probably be optimized quite a bit though.
Only syntaxes embedded in the binary works, but should be straightforward to make it work for locally built cache too. I have prepared the code a bit for that.
I disregard any public API stability concerns or other backwards compatibility (e.g. asset metadata)

So, how does the prototype work? Roughly like this:

There are are two new assets. independent_syntax_sets.bin contains concatenated binary representations of independent syntax sets, and independent_syntax_sets_map.bin contains a small lookup datastructure so they can be found again.
The new assets are generated by analyzing dependencies between SyntaxDefinitions and creating independent SyntaxSets. By loading only the necessary linked syntax definitions, startup time is greatly improved.

(I realize I should write a lot more details of how it works, but I unfortunately don't have the time for that right now. The code is there for anyone to poke around with though :))

My current plan forward is to take the code from the prototype and turn it into several small and independent PRs that are easy to review and understand one by one, to the extent that is possible and makes sense.

Just wanted to elaborate a bit on the next steps I intend to take on the prototype. I will keep updating this comment.

I plan on doing the following sequential steps:

[x] Get #1747 merged
[x] Get #1760 merged
[x] Get #1762 merged
[x] Get #1776 merged
[ ] Get #1787 merged
[ ] Create a PR that loads minimal syntax sets for all code paths, not only with --language
[ ] Create a PR that loads minimal syntax sets for all languages, not only languages with a single SyntaxReference. We probably need to find a way to reduce the growth of the binary size first, though.
[ ] Make building minimal syntax sets work with include_integrated_assets
[ ] Get rid of syntaxes.bin when minimal_syntaxes.bin reliably works for all syntaxes.

sharkdp / bat

Improve bat startup speed #951