Closed jrapin closed 1 year ago
Same pattern for this file as well (deadlocks as is, but not if we convert tabs into spaces): https://github.com/dLauraM/PROYECTO-01-MARTINEZ-DIANALAURA/blob/3381270416d92240ba9e5f6ffa246e70d26bb3c1/main.py
Thanks for the report! Could you try to remove parts of that file to get a minimal reproduction for this bug?
Hi @maxbrunsfeld , thank you for your answer. Based on the second example, the simpler (and pointless) code I found to bug was this:
bytecode = b'def main():\n\t\t\t\t\t\t\t\t\t\t\t\tif True:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 2:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 3:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n'
parser.parse(bytecode)
Confirming I also see similar failures on 0.20.8 while fuzz testing on tree-sitter-python
commit 62827156d01c74dc1538266344e788da74536b8a (base64 encoding: uwoJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCXsrYkRBAEhiQEEAfEBj
).
I have located the issue, which was introduced by tree-sitter version 0.20.7 (0.20.6 is good). py-tree-sitter version 0.20.1, which uses tree-sitter version 0.20.7, triggers this issue. In contrast, py-tree-sitter version 0.20.0, which uses tree-sitter version 0.20.2, does not have this issue.
This interested me, so I decided to bisect it as a start:
Log:
git bisect start
# status: waiting for both good and bad commits
# bad: [d0029a15273e526925a764033e9b7f18f96a7ce5] Avoid unused value warning from array_pop
git bisect bad d0029a15273e526925a764033e9b7f18f96a7ce5
# status: waiting for good commit(s), bad commit known
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# bad: [1848d0bc3632ef128d70736695a3fb969f6e0c69] Add tree included ranges getter
git bisect bad 1848d0bc3632ef128d70736695a3fb969f6e0c69
# good: [764c8c88ca2fb100f40fdbfbae61ffd32585fc67] last tweaks
git bisect good 764c8c88ca2fb100f40fdbfbae61ffd32585fc67
# good: [c0e1991f6bf439a52fb8a5a1176db341c56a7d63] :art: ts_parser__lex
git bisect good c0e1991f6bf439a52fb8a5a1176db341c56a7d63
# bad: [20d44ed13eb2785683a018bf9b58e40e381a3c3c] Merge pull request #1785 from petrisch/patch-1
git bisect bad 20d44ed13eb2785683a018bf9b58e40e381a3c3c
# bad: [3739b7794e381582c8f4a37b2ccec756c3504984] Merge pull request #1819 from slackner/uint16-production-id
git bisect bad 3739b7794e381582c8f4a37b2ccec756c3504984
# bad: [01df16ca9faa1635909ebea242deac200624d9ee] lib: 0.20.8
git bisect bad 01df16ca9faa1635909ebea242deac200624d9ee
# bad: [04381dcea31c9a6ccc4c4eeaf306e867f5fcc893] Add more python error recovery tests
git bisect bad 04381dcea31c9a6ccc4c4eeaf306e867f5fcc893
# bad: [5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb] Log when ignoring an empty external token after an error
git bisect bad 5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb
# bad: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state
git bisect bad d223a81b5064587af3a5f61f52d519670ba8995f
# first bad commit: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state
d223a81b5064587af3a5f61f52d519670ba8995f is the first bad commit
commit d223a81b5064587af3a5f61f52d519670ba8995f
Author: Max Brunsfeld <maxbrunsfeld@gmail.com>
Date: Fri Jun 24 15:58:13 2022 -0700
Allow empty external tokens during err recovery if they change the scanner's state
lib/src/parser.c | 64 ++++++++++++++++---------
lib/src/subtree.c | 48 ++++++++++++-------
lib/src/subtree.h | 12 ++++-
test/fixtures/error_corpus/python_errors.txt | 71 +++++++++++++++++++++++-----
4 files changed, 143 insertions(+), 52 deletions(-)
Some more investigating, the bug wasn't actually in tree-sitter's code, but the scanner.
diff --git a/src/scanner.cc b/src/scanner.cc
index e9ba26b..970a0e7 100644
--- a/src/scanner.cc
+++ b/src/scanner.cc
@@ -139,7 +139,7 @@ struct Scanner {
i += delimiter_count;
for (; i < length; i++) {
- indent_length_stack.push_back(buffer[i]);
+ indent_length_stack.push_back((unsigned char)buffer[i]);
}
}
}
The non explicit cast caused any indent that is UINT8_T max (255), to be truncated to -1, and then when that's expanded to a uint16_t in push_back, we get a wildly different result in the vector - so we gained an extra char. Then, the loop in ts_parser__lex compares the size of the serialized buffer length and the length returned when comparing if the external scanner state changes, and as a result they will never be equal because on serialize after this bug occurs - we write one less than is read, leading to a back-and-forth off-by-one fight which can be observed by adding a debug print in ts_external_scanner_state_eq:
bool ts_external_scanner_state_eq(const ExternalScannerState *a, const char *buffer, unsigned length) {
printf("a->length: %d, length: %d\n", a->length, length);
return
a->length == length &&
memcmp(ts_external_scanner_state_data(a), buffer, length) == 0;
}
demo:
As a result - #221 would actually fix this as I cast everything properly
@maxbrunsfeld I'd appreciate a review of both #220 and #221 :)
Wow, nice investigation @amaanq. Thanks so much for your work on this - I’ll review in the morning.
Hello, While I've parsed countless files so far without issue, when parsing this very file tree-sitter deadlocks and starts consuming more and more memory (100s of GB) until there is no more left, and it crashes with core dumped. Note that if I replace all the tabs
\t
by 4 spaces, then the parser however works just fine as usual.versions:
tree-sitter==0.20.1
tree-sitter-python
at tagc6cfa75 4f25ce7
(master as of Feb 28th)repro: