tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
152 stars 36 forks source link

Native and WASM parsers behave differently #69

Open wenkokke opened 2 years ago

wenkokke commented 2 years ago

For instance, if we parse examples/postgrest/test/Main.hs:

In this case, the problem seems to be absent module headers.

414owen commented 2 years ago

That's interesting, you can try running a -DDEBUG build, and see what the first few tokens the scanner produces are?

wenkokke commented 2 years ago

Not quite sure how to run a -DDEBUG build? You should be able to run these tests yourself, though. Lemme dig up the commands.

wenkokke commented 2 years ago

Check out #68 and run npm run examples-wasm. You'll need Node and Emscripten.

414owen commented 2 years ago

I'll give it a go.

I've documented -DDEBUG in enable scanner debug output here

414owen commented 2 years ago

Parsing examples/postgrest/test/Main.hs passes if you delete either both let statements, or the spec $ do ... block

414owen commented 2 years ago

I've created a list of the (smallest to largest) files that fail to parse under wasm:

{ for i in examples/**/*.hs; do out="$(2>&1 ./script/tree-sitter-parse.js $i)"; if echo "$out" | grep 'Parse error' > /dev/null; then wc -l "$i"; fi; done } | sort -n | tee list

output: here

The first file passes when you delete the -> in the GADT, or the type signature of test, or any two comment lines... I'll have a proper look this evening.

If you can spot anything that those files have in common, and others don't, let me know

maxbrunsfeld commented 2 years ago

One likely reason for the behavior difference is that tree-sitter’s get_column function currently returns a byte count, not a character count. The JavaScript bindings encode strings as UTF16, so most characters are 2 bytes instead of 1, as in UTF8.

The get_column method is still a bit experimental, and we’ve talked about changing it back to returning a character count. In the meantime, scanners should make sure to not make assumptions about the absolute values that method returns, but instead just compare the values to each other.

Not sure if that’s the problem here, but wanted to raise it, since it’s gotcha currently, and this is one of the few grammars affected.

414owen commented 2 years ago

Thanks @maxbrunsfeld, that sounds very plausible.

I wonder if I ~halve~ double the indentation on those files whether they'll break with the native parser...

edit they didn't

ahelwer commented 2 years ago

It's about time I get off my butt and write good tests for my codepoint-based get_column tree-sitter fork so it can be merged upstream.

wenkokke commented 2 years ago

I can confirm that this is mostly fixed using the latest version, the only file that doesn't parse using the web assembly bindings is <examples/haskell-language-server/test/testdata/format/BrittanyCRLF.formatted_range.hs>.

maxbrunsfeld commented 2 years ago

Ok great! If anyone gets a chance, it'd be super useful to extract from that file a minimal example of how the WASM and native parsers behave differently. At this point, if they still behave differently, that seems most likely to be a bug in Tree-sitter.