Very slow *editing* of large files when tree-sitter is used

kchibisov commented 2 years ago

Summary

Certain markdown files are very slow to edit. In particular when they have a lot of lists. For example a CHANGELOG.md file from the helix repo is a enough to trigger it.

The problem is likely in grammar here, since neovim with tree-sitter works fine on it.

While this particular problem could be solved with slightly optimized, but less accurate grammar. However the issue manifests itself on any large file of any other type, just requires much larger file than ~600LoC(I've used 50K LoC C++ that I have around).

To avoid delay and input lag in those particular scenarios it would make sense to either run tree-sitter only on a part of the buffer (I'm not familiar how it works) or update it after the typing ended. Since right now you have a latency of like 2 seconds on 50K lines file of C++, which is impossible to work with.

Reproduction Steps

Try to edit CHANGELOG.md from the helix repo or any large file.

Helix log

No response

Platform

Linux

Terminal Emulator

alacritty

Helix Version

helix 22.05 (3cced1e3)

MDeiml commented 2 years ago

Author of the markdown grammar here. It seems that at the moment language injection cannot be parsed incrementally. The markdown grammar uses injection a lot so it becomes quite apparent for it. This is what I get from line 846 here: https://github.com/helix-editor/helix/blob/8681fb6d9ee06610fac2baafb4ca34f93f9c7f4d/helix-core/src/syntax.rs#L841-L848 Notice how a layer can only be reused if ranges did not change. But this is often the case when editing a file. The test here could probably be laxer.

I didn't look to deep, but (from my knowledge of treesitter) any layer should be fine here to "reuse" as long as none is reused twice. So something akin to

Choosing the layer whose ranges "match best" in some way
Changing those ranges to the new ones

should work.

archseer commented 2 years ago

We actually follow tree-sitter-cli/atom logic to update all the markers according to user edits beforehand: https://github.com/helix-editor/helix/blob/8681fb6d9ee06610fac2baafb4ca34f93f9c7f4d/helix-core/src/syntax.rs#L644-L726=

So the injection layers are tracked reliably.

We also seem to only use HTML injections & code blocks: https://github.com/helix-editor/helix/blob/master/runtime/queries/markdown/injections.scm

MDeiml commented 2 years ago

Ah ok, I guess you are still using the old version of my parser then (the main branch). You could try updating to the split_parser branch. That one has some major performance improvements, but uses injection very extensively.

archseer commented 2 years ago

Ah, good point! I'll see if I can update us to the new branch 👍

asb commented 2 years ago

On an 8.4MiB markdown file (I keep daily notes in a single big document), helix 22.05 takes > 10s to open it and is used 1GiB of memory after doing so. I'm assuming this is tree-sitter related and the same issue described here, but let me know if you think not.

asb commented 2 years ago

The following script generates a ~5MiB file that takes >5 seconds to open with Helix. After opening, memory usage (reported via ps_mem) is ~660MiB.

#!/bin/sh

for i in $(seq 10000); do
  cat <<EOF >> big_markdown.md
# Title
## Subtitle
Some notes about things.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
in culpa qui officia deserunt mollit anim id est laborum

* A list
  * With a sub-item
* Item 2
* Item 3
* Item 4
* Item 5
  * Item 5.1

EOF
done

kchibisov commented 2 years ago

@asb if you disable grammar for markdown does that happen?

asb commented 2 years ago

@kchibisov I'm not sure what option to change to disable the grammar (I was trying out helix for the first time). Certainly, if mv big_markdown.md big markdown.txt and open the version of the file with a .txt extension, it opens instantly and helix consumes a more reasonable 18MiB RAM.

MDeiml commented 2 years ago

I think the simplest way to fix this is to only run the grammar for injected languages if they are visible (i.e. run it lazily).

Markdown is split into two grammars: block and inline. The block-level grammar is quite fast and produces the ranges on which the inline-level grammar is injected. So the computational cost mostly grows with the amount of inline content parsed.

I'd guess that the memory cost is also from the parsed inline-level trees, so that problem would also be solved in the same way, but I'd have to check that.

MDeiml commented 2 years ago

I'm not sure how to best implements something like this tough. Currently the parsing is handled by the Document impl, but for lazy parsing we need information from the Views that point to that document.

Yevgnen commented 2 years ago

Having trouble when open rust.org.gz 3000 line .org file and never finish. Renaming it to abc.xyz to avoid syntax highlight then the process is instant.

kirawi commented 2 years ago

I think the simplest way to fix this is to only run the grammar for injected languages if they are visible (i.e. run it lazily).

Markdown is split into two grammars: block and inline. The block-level grammar is quite fast and produces the ranges on which the inline-level grammar is injected. So the computational cost mostly grows with the amount of inline content parsed.

I'd guess that the memory cost is also from the parsed inline-level trees, so that problem would also be solved in the same way, but I'd have to check that.

I think that this could build upon the work in https://github.com/helix-editor/helix/pull/2857 to keep track of the nodes and lazily highlight injections.

MDeiml commented 2 years ago

Just wanted to mention #4146 here. Maybe that fixed some problems, but probably not all of them.

kchibisov commented 2 years ago

That indeed improved the situation and the editing of markdown is at least possibly with tree-sitter.

Large files still blow up helix leading to seconds in input lag. Not sure where it should be tracked.

midnightcodr commented 1 year ago

With helix 22.12 (34be71fb) I am seeing about 1 second delay per keystroke when doing either insert or search. The file I am editing is a 129MB mysql dump file with 150 very long lines.

The machine I am working on is a MacBook Pro M1 with 16GB of RAM.

dead10ck commented 1 year ago

Wondering if #6218 fixed this

archseer commented 1 year ago

It probably fixed or improved viewing the file, but insert mode will still be just as slow on large files.

archseer commented 1 year ago

The file I am editing is a 129MB mysql dump file with 150 very long lines.

Yeah both of these are an issue. You're parsing a very large file but performance is also not optimized on long lines. Or at least it wasn't before but maybe @pascalkuthe's rewrite improved things slightly.

pascalkuthe commented 1 year ago

The file I am editing is a 129MB mysql dump file with 150 very long lines.

Yeah both of these are an issue. You're parsing a very large file but performance is also not optimized on long lines. Or at least it wasn't before but maybe @pascalkuthe's rewrite improved things slightly.

Right now it shouldnt make much if a difference. Howver performance for huge lines will be improved in a future followup PR to #5420 where I will Implement line chunking. This was omitted from the MVP but the architecture lends itself well to this optimization (and was designed with it in mind, I even had a mostly finished implementation but the changes were accidently deleted by an overwrite from a second helix instance)

kchibisov commented 1 year ago

While my original issue is resolved with the markdown.

I'd think I'll leave a real world example of a big file that is well formatted and not some dump of the logging utility, since I've mentioned in original report a C++ file, but I haven't showed it.

So that's the file I've referred to originally wrt my C++ quote.

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLowering.cpp

The :set-language text makes a normal editing experience(same perf as file of any other size).

mvdan commented 1 year ago

I use git commit --verbose, which causes the COMMIT_EDITMSG files to contain the full diff of what's being committed. At work, I just happened to commit changes weighing about ~400k lines of diff, all generated, and I experienced this slowness - each insert keystroke lagged by about a second. @kchibisov's suggestion of :set-language text works like a charm as a workaround, for now.

Still happening with 23.03, for what it's worth.

helix-editor / helix