tree-sitter / tree-sitter-python

Python grammar for tree-sitter
MIT License
383 stars 140 forks source link

Deadlock/memory leak/crash when parsing a specific file #207

Closed jrapin closed 1 year ago

jrapin commented 1 year ago

Hello, While I've parsed countless files so far without issue, when parsing this very file tree-sitter deadlocks and starts consuming more and more memory (100s of GB) until there is no more left, and it crashes with core dumped. Note that if I replace all the tabs \t by 4 spaces, then the parser however works just fine as usual.

versions:

repro:

import requests
from pathlib import Path
import tree_sitter as ts

# download code to parse
url = "https://raw.githubusercontent.com/Tomi23R/sureBet-bot/294dda827eab8440fe06b97f698beabf6b580946/crawler/tenis_codere.py"
res = requests.get(url)
code = res.content.decode("utf8")

# instantiate tree-sitter
lib_path = Path().resolve().absolute() / "tree-sitter" / "python.so"
assert lib_path.exists(), "Update the lib_path to your install"
language = ts.Language(lib_path, "python")
parser = ts.Parser()
parser.set_language(language)

# this does not deadlock:
parser.parse(code.replace("\t", "    ").encode("utf8"))

# this deadlocks:
parser.parse(code.encode("utf8"))
jrapin commented 1 year ago

Same pattern for this file as well (deadlocks as is, but not if we convert tabs into spaces): https://github.com/dLauraM/PROYECTO-01-MARTINEZ-DIANALAURA/blob/3381270416d92240ba9e5f6ffa246e70d26bb3c1/main.py

maxbrunsfeld commented 1 year ago

Thanks for the report! Could you try to remove parts of that file to get a minimal reproduction for this bug?

jrapin commented 1 year ago

Hi @maxbrunsfeld , thank you for your answer. Based on the second example, the simpler (and pointless) code I found to bug was this:

bytecode = b'def main():\n\t\t\t\t\t\t\t\t\t\t\t\tif True:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 2:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n\t\t\t\t\t\t\t\t\t\t\t\telif option_menu == 3:\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tfunc()\n'
parser.parse(bytecode)
pqn commented 1 year ago

Confirming I also see similar failures on 0.20.8 while fuzz testing on tree-sitter-python commit 62827156d01c74dc1538266344e788da74536b8a (base64 encoding: uwoJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCQkJCXsrYkRBAEhiQEEAfEBj).

ChillMagic commented 1 year ago

I have located the issue, which was introduced by tree-sitter version 0.20.7 (0.20.6 is good). py-tree-sitter version 0.20.1, which uses tree-sitter version 0.20.7, triggers this issue. In contrast, py-tree-sitter version 0.20.0, which uses tree-sitter version 0.20.2, does not have this issue.

amaanq commented 1 year ago

This interested me, so I decided to bisect it as a start:

Log:

git bisect start
# status: waiting for both good and bad commits
# bad: [d0029a15273e526925a764033e9b7f18f96a7ce5] Avoid unused value warning from array_pop
git bisect bad d0029a15273e526925a764033e9b7f18f96a7ce5
# status: waiting for good commit(s), bad commit known
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# good: [062421dece3315bd6f228ad6d468cba083d0a2d5] 0.20.1
git bisect good 062421dece3315bd6f228ad6d468cba083d0a2d5
# bad: [1848d0bc3632ef128d70736695a3fb969f6e0c69] Add tree included ranges getter
git bisect bad 1848d0bc3632ef128d70736695a3fb969f6e0c69
# good: [764c8c88ca2fb100f40fdbfbae61ffd32585fc67] last tweaks
git bisect good 764c8c88ca2fb100f40fdbfbae61ffd32585fc67
# good: [c0e1991f6bf439a52fb8a5a1176db341c56a7d63] :art: ts_parser__lex
git bisect good c0e1991f6bf439a52fb8a5a1176db341c56a7d63
# bad: [20d44ed13eb2785683a018bf9b58e40e381a3c3c] Merge pull request #1785 from petrisch/patch-1
git bisect bad 20d44ed13eb2785683a018bf9b58e40e381a3c3c
# bad: [3739b7794e381582c8f4a37b2ccec756c3504984] Merge pull request #1819 from slackner/uint16-production-id
git bisect bad 3739b7794e381582c8f4a37b2ccec756c3504984
# bad: [01df16ca9faa1635909ebea242deac200624d9ee] lib: 0.20.8
git bisect bad 01df16ca9faa1635909ebea242deac200624d9ee
# bad: [04381dcea31c9a6ccc4c4eeaf306e867f5fcc893] Add more python error recovery tests
git bisect bad 04381dcea31c9a6ccc4c4eeaf306e867f5fcc893
# bad: [5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb] Log when ignoring an empty external token after an error
git bisect bad 5aa2f4dc8c00dacf72d6b46a75b2020c3d4d99cb
# bad: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state
git bisect bad d223a81b5064587af3a5f61f52d519670ba8995f
# first bad commit: [d223a81b5064587af3a5f61f52d519670ba8995f] Allow empty external tokens during err recovery if they change the scanner's state

d223a81b5064587af3a5f61f52d519670ba8995f is the first bad commit
commit d223a81b5064587af3a5f61f52d519670ba8995f
Author: Max Brunsfeld <maxbrunsfeld@gmail.com>
Date:   Fri Jun 24 15:58:13 2022 -0700

    Allow empty external tokens during err recovery if they change the scanner's state

 lib/src/parser.c                             | 64 ++++++++++++++++---------
 lib/src/subtree.c                            | 48 ++++++++++++-------
 lib/src/subtree.h                            | 12 ++++-
 test/fixtures/error_corpus/python_errors.txt | 71 +++++++++++++++++++++++-----
 4 files changed, 143 insertions(+), 52 deletions(-)
amaanq commented 1 year ago

Some more investigating, the bug wasn't actually in tree-sitter's code, but the scanner.

diff --git a/src/scanner.cc b/src/scanner.cc
index e9ba26b..970a0e7 100644
--- a/src/scanner.cc
+++ b/src/scanner.cc
@@ -139,7 +139,7 @@ struct Scanner {
       i += delimiter_count;

       for (; i < length; i++) {
-        indent_length_stack.push_back(buffer[i]);
+        indent_length_stack.push_back((unsigned char)buffer[i]);
       }
     }
   }

The non explicit cast caused any indent that is UINT8_T max (255), to be truncated to -1, and then when that's expanded to a uint16_t in push_back, we get a wildly different result in the vector - so we gained an extra char. Then, the loop in ts_parser__lex compares the size of the serialized buffer length and the length returned when comparing if the external scanner state changes, and as a result they will never be equal because on serialize after this bug occurs - we write one less than is read, leading to a back-and-forth off-by-one fight which can be observed by adding a debug print in ts_external_scanner_state_eq:

bool ts_external_scanner_state_eq(const ExternalScannerState *a, const char *buffer, unsigned length) {
  printf("a->length: %d, length: %d\n", a->length, length);
  return
    a->length == length &&
    memcmp(ts_external_scanner_state_data(a), buffer, length) == 0;
}

demo:

image

As a result - #221 would actually fix this as I cast everything properly

@maxbrunsfeld I'd appreciate a review of both #220 and #221 :)

maxbrunsfeld commented 1 year ago

Wow, nice investigation @amaanq. Thanks so much for your work on this - I’ll review in the morning.