m-novikov / tree-sitter-sql

SQL syntax highlighting for tree-sitter
MIT License
110 stars 32 forks source link

Generated too large parser.c ~ 83M #59

Open pplam opened 2 years ago

pplam commented 2 years ago

With all PRs merged, tree-sitter will generate a very large parser.c file (about 83M). And it takes about 1 min to load the parser by tree-sitter python binding.

Unfortunately I don't have an idea at the moment on how to reduce its size 😭

@m-novikov Have you ever encountered a similar problem, or have any suggestions for optimization?

imotai commented 2 years ago

I think we can generate the parser.c when building this project

pplam commented 2 years ago

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.

With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines).

But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python.

I don't have an idea on this yet 😭

imotai commented 2 years ago

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated.

With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines).

But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python.

I don't have an idea on this yet 😭

Have you try the release mode ? the follow library produced by cargo build --release seems not so large.

image
m-novikov commented 2 years ago

There is related issue in https://github.com/tree-sitter/tree-sitter/issues/1041 While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.

pplam commented 2 years ago

There is related issue in tree-sitter/tree-sitter#1041 While removing parser from the repo is a good stop-gap solution, we would need to investigate the cause, generation times also getting progressively longer.

@m-novikov Yes, I've seen this similar issue too. But there is not much useful information here.

By remove the generated code from the repo, we just ignore this issue.

On the other hand, when a user using tree-sitter-sql, for example, using it in Python code via the tree-sitter python package. The user use it by the following steps:

  1. Build language (for tree-sitter-sql) by the tree-sitter package, it will look for the parser.c and scanner.cc files in the src dir and compiles them;
  2. Create a new tree-sitter.Parser instance and set language to the language built in step 1;
  3. The src/node-types.json and src/grammar.json can also be loaded and used by users;

If we remove the src dir from the repo, the users still need to generate it by themselves. It will take 40s+ to generate this files (for PR#58). Also the size of parser.c is large (83M+), this leads to a long time (about 1min) to build language (the step 1).

So, to solve the problem, we need to do something to reduce the size of generated code. By now, I did some investigations on this.

  1. Collect generated files size for each commit to see which commit increased the size too much. The following are the top commits:
hexsha commit_message grammar.js grammar.json node-types.json parser.c
8d09bf6af0ba81c799e05d839263ec3b3c49d1e2 Support CTE, CASE, etc. New supportings:   - CTEs syntax;   - DELETE statement;   - ALTER TABLE RENAME COLUMN;   - Conditional expressions;   - LIMIT clause;   - AT TIME ZONE expression; Improvements:   - The ORDER BY clause;   - The JOIN clause;   - The select_subexpression;   - The UPDATE statement;   - Array type;   - $.type: multi-words types (VARYING, PRECISION, WITH/WITHOUT TIME ZONE);   - The $.field_access expression to $.json_access expression; 0.02M, +0.0M, +2.46% 0.14M, +0.0M, +2.21% 0.11M, +0.01M, +10.77% 3.95M, +1.59M, +67.12%
f1a4b8c896cfd4041ac2283c9f4b6d814a7c0c5d Improve create function statement & support create trigger statement 0.03M, +0.0M, +19.18% 0.18M, +0.04M, +30.48% 0.12M, +0.01M, +13.91% 11.89M, +7.94M, +200.86%
bf3ea7293f436b1d226e6d5179a064f4344aa65f Improve function call to support aggregate expressions 0.03M, -0.0M, -0.48% 0.28M, +0.0M, +1.69% 0.15M, +0.0M, +0.82% 21.78M, +2.82M, +14.88%
1cdb6abb48a4013dd7256b064e6e83574e3d0cee Support window definitions & window function calls 0.04M, +0.0M, +4.05% 0.29M, +0.01M, +4.06% 0.16M, +0.01M, +4.31% 25.25M, +3.47M, +15.91%
d68da8ace84b5c9b66dcd9243afbb5dade795909 Improved the FROM clause to make it more general 0.04M, +0.0M, +3.28% 0.3M, +0.01M, +3.5% 0.16M, +0.01M, +5.24% 34.1M, +8.85M, +35.07%
84cedb178c6a94952f1feb3083bd4e7ef54c69d8 Improve the ALTER TABLE statement 0.05M, +0.01M, +21.2% 0.38M, +0.06M, +19.64% 0.22M, +0.04M, +19.67% 38.05M, +2.45M, +6.88%
33a6c85ad3a26ac3d6ab0a7c54a5b2fb2e9e495c Improve the INSERT statement 0.05M, +0.0M, +1.97% 0.38M, +0.01M, +1.84% 0.22M, +0.01M, +2.39% 54.43M, +16.26M, +42.61%
42f750d88408fee9b990c917b907a88e96ac0111 Support more comparison predicates 1. BETWEEN...AND expression; 2. ISNULL and NOTNULL; 3. IS UNKNOWN and IS NOT UNKNOWN; 0.05M, +0.0M, +1.14% 0.39M, +0.0M, +0.78% 0.24M, +0.02M, +10.1% 61.25M, +6.33M, +11.53%
2660004c3495889ed947e1fb7979c17522f7a23f Support composite values & resolve syntax conflicts 0.05M, +0.0M, +1.15% 0.39M, +0.0M, +1.03% 0.25M, +0.01M, +3.89% 62.51M, +1.26M, +2.05%
87107e91e5b09b8d235f61e5135016d08fccd129 Support combining queries, JSON comparison operators and type cast as column default 0.05M, +0.0M, +2.02% 0.4M, +0.01M, +1.29% 0.25M, +0.0M, +0.47% 79.69M, +17.18M, +27.48%
5574bf3ca47507e87e3524f38783a21f0d33841d Support VACUUM statement & improve alias 0.05M, +0.0M, +2.28% 0.41M, +0.01M, +3.04% 0.25M, +0.0M, +0.69% 82.5M, +2.81M, +3.53%

  1. Then I checked changes of grammar.js for these top commits, and tried simply comment out some rules or clauses to see the effect. Following are the modifications and result sizes I've tried:
>>>>>>>>>> Sizes in the latest version

-rw-r--r--  1 tim  staff   421K Jul  8 21:49 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   264K Jul  8 21:50 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    83M Jul  8 21:50 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

>>>>>>>>>> With the select statement modified

1. Removed select_statement
-rw-r--r--  1 tim  staff   413K Jul  8 21:39 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   257K Jul  8 21:39 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    29M Jul  8 21:39 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

2. Removed window_clause in select_statement
-rw-r--r--  1 tim  staff   420K Jul  8 21:42 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:42 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    82M Jul  8 21:42 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

3. Removed from_clause in select_statement
-rw-r--r--  1 tim  staff   421K Jul  8 21:44 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   264K Jul  8 21:45 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    62M Jul  8 21:45 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

4. Removed group_by_clause in select_statement
-rw-r--r--  1 tim  staff   421K Jul  8 21:51 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:51 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    75M Jul  8 21:51 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

5. Removed window_clause, from_clause, group_by_clause from select_statement
-rw-r--r--  1 tim  staff   420K Jul  8 21:53 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   263K Jul  8 21:53 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    53M Jul  8 21:53 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

>>>>>>>> With expressions modified

1. Removed select_subexpression from _simple_expression
-rw-r--r--  1 tim  staff   421K Jul  8 22:00 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   259K Jul  8 22:00 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    83M Jul  8 22:00 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

2. Removed select_subexpression from all rules
-rw-r--r--  1 tim  staff   420K Jul  8 22:07 grammar.json
-rw-r--r--  1 tim  staff    13K Jun 13 10:36 grammar.json.rej
-rw-r--r--  1 tim  staff   259K Jul  8 22:07 node-types.json
-rw-r--r--  1 tim  staff    17K Jun 13 10:36 node-types.json.rej
-rw-r--r--  1 tim  staff    78M Jul  8 22:07 parser.c
-rw-r--r--  1 tim  staff   5.5M Jun 13 10:36 parser.c.rej
-rw-r--r--  1 tim  staff   4.2K Jul  6 01:59 scanner.cc
drwxr-xr-x  3 tim  staff    96B May 31 18:20 tree_sitter

From the above data, we can see that the SELECT statement introduced a large amount of size.

Unfortunately , on how to optimize it, I still have no good idea. Could you guys give me some suggestion?

pplam commented 2 years ago

Yeah, there doesn't seem to be redundant code here, nor particularly complex rules, but still a very large parser is generated. With compared to the tree-sitter-python parser, the size of the grammar.js file 2x larger (1833 lines vs 967 lines). But the generated parser.c is almost 64x larger (83M vs 1.3M) than the one in tree-sitter-python. I don't have an idea on this yet 😭

Have you try the release mode ? the follow library produced by cargo build --release seems not so large. image

It seems that the parser.c, scanner.cc etc. files are not generated by this manner, but those file are required when using the tree-sitter-sql, at least in my case (use it in Python). Please take a look at my previous comment for more details.

m-novikov commented 2 years ago

parser.c is generated file, but scanner.cc is not. Use in python implies that user run tree-sitter generate before. You can see it in #63 binding.gyp in this PR has a step to generate files required.