tree-sitter / tree-sitter

An incremental parsing system for programming tools
https://tree-sitter.github.io
MIT License
17.97k stars 1.35k forks source link

`Error: Invalid argument` when parsing large json files #3473

Open xloc opened 1 month ago

xloc commented 1 month ago

Problem

Not sure if it is the correct place to post this issue, but ...

I'm using the node binding of tree-sitter, and Parser.parse(string) reports the following error when I try to parse a large i18n json file (~1.2M).

/home/zxie/workspace/tree-sit-large-file-test/node_modules/tree-sitter/index.js:361
    ? parse.call(
            ^

Error: Invalid argument
    at Parser.parse (/home/zxie/workspace/tree-sit-large-file-test/node_modules/tree-sitter/index.js:361:13)
    at file:///home/zxie/workspace/tree-sit-large-file-test/index.js:27:15
    at ModuleJob.run (node:internal/modules/esm/module_job:262:25)
    at async onImport.tracePromise.__proto__ (node:internal/modules/esm/loader:485:26)
    at async asyncRunEntryPointWithESMLoader (node:internal/modules/run_main:109:5)
Code
import Parser from 'tree-sitter';
import JsonLanguage from 'tree-sitter-json';

const parser = new Parser();
parser.setLanguage(JsonLanguage);

import {readFileSync} from 'fs';
const file = readFileSync('./lang.json', 'utf-8');

const tree = parser.parse(file);
console.log(tree.rootNode.text.slice(0, 100));

I tried to use the Parser.Input callback instead of passing in a string, but node.text doesn't load properly

const lines = file.split('\n');
const lineGetter = (_, p) => {
    if (!p) return null;
    const line = lines[p.row];
    if (!line) return null;
    return line[p.column] + '\n';
}
tree = parser.parse(lineGetter);
console.log(tree.rootNode.text.slice(0, 100));

this outputs

nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull

For details, please refer to my repo of minimal reproducible example.
https://github.com/xloc/tree-sit-large-file-test

Thanks in advance!

Steps to reproduce

git clone https://github.com/xloc/tree-sit-large-file-test.git
cd tree-sit-large-file-test
npm i
node index.js

Expected behavior

the file can be parsed correctly without errors

Tree-sitter version (tree-sitter --version)

tree-sitter=0.21.1; tree-sitter-json=0.21.0

Operating system/version

Windows 10.0.19044 N/A Build 19044; Node.js v22.4.0

WillLillis commented 1 month ago

Interesting, issue doesn't show up when using the C library or Rust bindings:

Rust ```rust fn main() { let mut parser = tree_sitter::Parser::new(); parser.set_language(&tree_sitter_json::language()).unwrap(); let file = include_str!("../lang.json"); let tree = parser.parse(file, None).unwrap(); print!( "{:?}", &tree.root_node().utf8_text(file.as_bytes()).unwrap()[0..100] ); } ```
C ```c #include "tree_sitter/api.h" #include #include const TSLanguage *tree_sitter_json(void); int main(int argc, char *argv[]) { TSParser *parser = ts_parser_new(); ts_parser_set_language(parser, tree_sitter_json()); FILE *fp = fopen("lang.json", "r"); if (!fp) { perror("Failed to read in file"); } fseek(fp, 0L, SEEK_END); size_t sz = ftell(fp); rewind(fp); const char *source = malloc(sz * sizeof(char)); if (!source) { perror("tree is NULL\n"); } fread((void*)source, 1, sz, fp); TSTree *tree = ts_parser_parse_string(parser, NULL, source, strlen(source)); if (tree == NULL) { perror("tree is NULL\n"); } TSNode root_node = ts_tree_root_node(tree); uint32_t start = ts_node_start_byte(root_node); for (size_t idx = 0; idx < 100; idx++) { putchar(source[start + idx]); } return EXIT_SUCCESS; } ```
[lillis@LaptopWill] ~/projects/tree-sit-large-file-test (main) ⚡
❯ gcc -I /home/lillis/projects/tree-sitter/lib/include test.c /home/lillis/projects/tree-sitter-json/src/parser.c /home/lillis/projects/tree-sitter/libtree-sitter.a -o test

[lillis@LaptopWill] ~/projects/tree-sit-large-file-test (main) ⚡
❯ ./test
{
"h6aTZoqB1C": "IppZxiu3b4NPay7bTT9f2pB9oHrttFEEjneB6CGjvy4zh2u6FdOJUAYqAaS39KWkfrG7PlbOrHrH8laiJCg%

[lillis@LaptopWill] ~/projects/tree-sit-large-file-test (main) ⚡
❯ cargo run
   Compiling tree-sit-large-file-test v0.1.0 (/home/lillis/projects/tree-sit-large-file-test)
    Finished dev [unoptimized + debuginfo] target(s) in 0.62s
     Running `target/debug/tree-sit-large-file-test`
"{\n\"h6aTZoqB1C\": \"IppZxiu3b4NPay7bTT9f2pB9oHrttFEEjneB6CGjvy4zh2u6FdOJUAYqAaS39KWkfrG7PlbOrHrH8laiJCg"%
xloc commented 1 month ago

Thanks a lot for the quick reply!

So the problem is very likely from the nodejs side

kzhongffish commented 3 weeks ago

I hit the same problem in tree-sitter-javascript 0.21.4, however setting an appropriate buffer size solves the problem.

    // Define the options with bufferSize
    const options: Parser.Options = {
      bufferSize: 1024 * 1024, // Set the bufferSize to 1 MB (1024 KB)
    };
   parser.parse(jsCode, undefined, options);
xloc commented 3 weeks ago

Awesome! I didn’t realized that there is an option for buffer size. Will investigate further.

Thanks a lot!