Closed Harmos274 closed 2 years ago
thanks @JonathanLorimer ! yes, gcc
First, a disclaimer: this is not exactly the area I work on, so please take whatever I say with a grain of salt! From a glance over the source code, I'll write my 2 cents.
Generally, std::function
is very heavy weight. In some cases, compilers can do optimizations, but often they don't bother because it requires lots of engineering effort and analyses for (usually) little performance benefit.
So, unfortunately, I think it will be difficult to get good performance using std::function
. There are two options I can think of, if you want to keep a functional style:
I hope this is helpful!
thanks a lot, I will take a close look at those!
Here are some pretty simple changes that halve the time taken by the scanner: https://github.com/tree-sitter/tree-sitter-haskell/compare/master...414owen:faster-scanner-demo?expand=1
I'm going to keep going, replacing combinators with first-order logic, and see what the results look like.
that is crazy
while performance is now pretty good on medium files, trying to insert code into base
's Data.Map.Internal
(at >4k lines) is still almost impossible.
however, the question that is still unanswered is whether the apparent reparsing of the entire file on each keystroke is the result of incorrect design of this grammar or an nvim thing (how does this behave in helix, @414owen ?)
Performance in helix
is also terrible when editing Data.Map.Internal
.
maybe incremental parsing only works reliably when the scanner is simple :thinking:
@maxbrunsfeld I don't suppose we could get your expert opinion here?
The state of affairs:
Does this indicate something's wrong? Are there steps that make the scanner incremental, that we might have missed?
See https://github.com/tree-sitter/tree-sitter-haskell/issues/41#issuecomment-950424044 for a more in-depth hypothesis.
Also ran in this today, for this file.
my impression so far is that since
std::function
is an object that stores all of its closure's captured variables, and most of those variables are again functions, and all of those functions are stack-allocated in other parser objects, there's just a lot of copying and allocations going on, especially when, as you noted, the parsers have value parameters likeSymbolic::type
and the current indent.std::function
is probably not all that suited for functional programming
No it's not unfortunately :disappointed:. std::function
has a lot of overhead compared to a normal function that is called. There's the overhead of malloc, but also you have "pointer-chasing code", which is bad for cache-hits.
I would not write the C++ in a functional style, and go for a imperative/procedural solution with mutable state (even though it pains me to say so..). Try to avoid malloc completely, resort to variables on the stack (or pre-allocate memory).
But it looks like @414owen is already on it's way to improve the current parser. :smiley: If you need a reviewer, let me know.
@luc-tielen The imperative scanner changes work, and are ready for review. There are a few things I still want to do (eg. make state global rather than pass it around), but yeah any tips you have would be appreciated.
Just wanted to say thanks to @414owen, @luc-tielen, and @tek for working to improve this tree sitter lib. I am really excited to switch this back on.
@414owen I reviewed all your changes. I liked your approach of tiny commits. :) Unfortunately I was a little too slow and @tek already merged in your PR :P. Could you take a look at the comments (and maybe submit a part 3 PR?).
sorry :grimacing:
No worries! These changes are already a big improvement (and I didn't see anything wrong with the new code).
The way that I'd debug this is that I'd set up a small example file, and parse the file using Tree-sitter's -D
/--debug-graph
argument, which generates a complete report of the parsing process. Also, Tree-sitter allows you to simulate an edit to the file from the command line, and re-parse the file after the edit.
Create a small example file with some declarations sampled from Map/Base.hs
-- test.hs
elemAt :: Int -> Map k a -> (k,a)a
take :: Int -> Map k a -> Map k a
drop :: Int -> Map k a -> Map k a
Parse the file from the command line, generating debug output (requires that dot
from the graphviz
package is present on your PATH
):
tree-sitter parse test.hs --debug-graph
From this graph, you can see when Tree-sitter is processing an ambiguity, because the parse stack will be "forked".
Re-parse the file incrementally after inserting the character 'x' at the beginning of line 2. The syntax for the --edit
argument is either position bytes_removed string_inserted
, where position
can be either a byte offset or a row and a column, separated by a comma. Here I use the latter:
tree-sitter parse test.rs --edit '2,0 0 x' --debug-graph
My guess is that the reason for the poor incremental parsing performance is the ambiguity in the grammar, not the scanner. But I'm not certain.
ah, I never thought of looking at the graph when doing an edit, thanks for the tip
Hmm, on the recomputation side, if I'm interpreting it correctly, this line returns the amount of elements copied, rather than the amount of bytes (as the docs suggest).
I realise that we're the main consumers of the number returned, in deserialize
, but I'm wondering if our state gets corrupted...
:grimacing: that sounds plausible. curious that this hasn't caused any serious errors
Using the --debug-graph
flags, I see a bunch of
cant_reuse_node_is_fragile tree:function
cant_reuse_node_is_fragile tree:_funlhs
cant_reuse_node_is_fragile tree:_funvar
cant_reuse_node_is_fragile tree:_fun_name
...etc
One for every function in my test file.
@maxbrunsfeld is fragility
documented anywhere?
If a rule is fragile, are all of its parent rules marked as fragile too?
Is there a way to get a list of all fragile grammar rules?
There are a couple of reasons that a node can be marked as fragile. Certain parse states are statically known to be fragile, because of precedence usage. But mainly, nodes are also marked as fragile if they are created while the parse stack is forked due to a conflict
. I think that conflicts are likely the culprit here.
oh yeah there are lots of conflicts in the grammar
It's generally fine for there to be a lot of conflicts: usually, the "fragile" nodes that are created during conflicts are fully contained within some more stable structure, so that a lot of the syntax tree remains reusable.
I think it becomes a problem when the conflicts commonly occur at the top-most level of the source-file, so that there is no "non-fragile" nodes that can be reused. If someone wants to work on improving this, I would suggest looking into what conflicts are arising at the top-most level, when parsing the outer structure of a declaration or a function.
If there is some specific Haskell language extension that is causing these conflicts, it may be worthwhile to scale back the support for that extension, in the case of top-level constructs, so that we can reliably get incremental parsing at the topmost level of a source file.
template haskell introduces top level splices, which account for almost half of all conflicts. unfortunately, this is a very common feature, I don't think it would be feasible to deactivate :disappointed:
Ah yeah, I can see how that would cause top-level conflicts. I'm not sure what to do about that. What percentage of Haskell source files use top-level splices without the $()
syntax? Unfortunately, it seems like we may have to choose between either supporting that language feature, or getting usable performance in text editors like Neovim.
in my code, it's usually maybe 20%, not sure. almost all of those are of the shape
someFunc ''SomeType
so I would speculate that we could support this highly specialized variant as an unambiguous variant of the type signature rule without losing too much – hoping that the other conflicts have much less of an impact. What I don't understand is what that would mean for files containing unsupported TH – can we achieve that it would be skipped without poisoning the rest of the file?
I like that idea.
What I don't understand is what that would mean for files containing unsupported TH – can we achieve that it would be skipped without poisoning the rest of the file?
I don't think it would ruin the parsing of the whole remainder of the file. I think you'd often get fairly small ERROR
nodes in the vicinity of the splice, but other parts of the file would parse fine. I'm not 100% positive how it would play out with this grammar though.
looking at those TH conflicts, I see that there's also:
[$.signature, $.pat_name],
which is actually disambiguating top decl signatures from equations. so this might just as well be responsible for the problem. however, just because that's the best I could do after my month of iterating on the grammar doesn't mean we can't find a conflict free version of this :)
I also have the impression that this is a question of the trade-off between conflict freedom and precise semantic naming of top level nodes – you could imagine a tree that starts with (top_level_initial_varid
instead of (top_splice
or (signature
, and then branches based on what follows
performance is now stellar!
Good morning,
I like tree-sitter haskell very much but it seems it considerably slows when a file pass a certain number of characters. I don't actually know if the cause is the file's pattern complexity or anything but this is very penalizing...
Here's an example of slow file if you wan't to reproduce it :
Configuration:
Thank you for your help !