Performance degrades exponentially as the number of lines in a file increases

concur1 commented 2 weeks ago

It seems that the time it takes to lint a file increases exponentially with the size the file.
It seems that performance when using bigquery is worse than other SQL flavors (e.g. ANSI). E.g. for a bigquery file it will take ~2 seconds to lint a fairly simple file with 400 lines.

I have tested this by generating sql files with an increasing number of lines, linting each file and measuring the time it takes.

Here is the result in a graph form:

In case you would like to try and replicate this: Here is my config file:

[sqlfluff]
dialect = bigquery
max_line_length = 120
exclude_rules = CP02, LT01

[sqlfluff.indentation]
tab_space_size = 2

[sqlfluff:rules:capitalisation.keywords]
extended_capitalisation_policy = upper

[sqlfluff:rules:extended_capitalisation.functions]
extended_capitalisation_policy = lower

[sqlfluff:rules:extended_capitalisation.types]
extended_capitalisation_policy = upper

Here the final file with 608 lines: test_sql.txt (I couldnt upload a .sql file here for some reason)

Here is the python script I used to generate the different files:

initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""

import random
import subprocess
import time

def run_sqruff_check(size_param):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")

    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)

    with open("./test.sql", "w") as file:
        file.write(output)

    start_time = time.time()
    subprocess.run(["sqruff", "lint"], check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open("./results.csv","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")

for size_param in range(100):
    run_sqruff_check(size_param)

gvozdvmozgu commented 2 weeks ago

It seems the issue is with the lexer, specifically with the regular expressions.

concur1 commented 1 week ago

This probably fixes it for my use-case but I think the performance still degrades exponentially, just more slowly. I have a plot demonstrating this below but I just want to say; thank you for looking into this @gvozdvmozgu and @benfdking , I didn't expect a fix to be done so quickly! This really improves things for my use-case.

Here is how the plot with lines for both before and after the fix if you are interested:

Here is a plot using sqruff after the fix in comparison to sqlfluff. I know this is a special case as it is a simple example and in my experience for more complex files sqruff will generally beat sqlfluff in performance.

Even so I want to share this plot that shows the sqruff lint time increasing exponentially while sqlfluff lint time looks like it is increasing linearly or at least with a much less steep curve. The sqlfluff byte limit kicks in a bit after 1500 lines, that is why there us a drop off for the sqlfluff times.

gvozdvmozgu commented 1 week ago

How can I replicate this?

concur1 commented 1 week ago

How can I replicate this?

I created the data for all of these plots with the above python script. Running the scriot creates a file called results.csv, you just need to rename/move the results file after each run.

To run for the latest sqruff I changed this line:

subprocess.run(["sqruff", "lint"]

To instead be

subprocess.run(["cargo", " run", "lint"]

(This works if you are running the script from the cli directory with the .sqruff file present as cargo run is building/running the sqruff cli here)

To run for sqlfluff, sqlfluff must be installed and that line must be changed to:

subprocess.run(["sqlfluff", "lint"]

Hope that helps. If not I will try and create a notebook which will hopefully be a bit more reproducible and share that with you

concur1 commented 1 week ago

One other thing, you will want to change the value inside the range function to control how many loops there are and how many files are created.

To create that last graph there were 300 loops.

concur1 commented 1 week ago

One more thing; when running for sqlfluff you need to create the .sqlfluff file with the config that I gave above.

concur1 commented 1 week ago

Ok, I have made a single script that you should work as long as you have sqlfluff installed and you run it from the cli directory of the sqruff repo. It will create two files: sqruff_results.csv and sqlfluff_results.csv:

import random
import subprocess
import time

initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
  id STRING NOT NULL,
  product STRING NOT NULL,
  cost INTEGER NOT NULL,
  timestamp_added TIMESTAMP NOT NULL  
);

INSERT INTO `project_id.dataset_id.table_id` (
    id,
    product,
    cost,
    timestamp_added
)
VALUES"""

def run_sqruff_check(size_param: int, output_filename: str, command: list):
    insert_value_strings_list = []
    for i in range(size_param):
        insert_value_strings_list.append(f"""(
        {i},
        {random.choice(["bike", "car", "tv", "orange"])},
        {random.choice([100, 200, 300, 400])},
        current_timestamp()
    )""")

    output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)

    with open("./test.sql", "w") as file:
        file.write(output)

    start_time = time.time()
    subprocess.run(command, check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    with open(f"./{output_filename}","a") as file:
        file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")
    print(f"{(size_param*6)+14}, {time.time()-start_time}")

def run_performance_checks(loops:int, output_filename: str, command: list):
    with open("./.sqlfluff", "w") as file:
        file.write("""
[sqlfluff]
dialect = bigquery
                   """)

    for size_param in range(loops):
        run_sqruff_check(size_param, output_filename, command)    

print("running sqruff performance checks")
run_performance_checks(200, "sqruff_results.csv", ["cargo", "run", "lint"])

print("running sqlfluff performance checks")
run_performance_checks(200, "sqlfluff_results.csv", ["sqlfluff", "lint"])

It only runs up to a file size of 1208 lines, but that is enough to see that the performance of sqruff starts lagging behind the performance of sqlfluff (at least on my machine)

let me know if it doesn't work for some reason

quarylabs / sqruff

Performance degrades exponentially as the number of lines in a file increases #939