Open concur1 opened 2 weeks ago
It seems the issue is with the lexer, specifically with the regular expressions.
This probably fixes it for my use-case but I think the performance still degrades exponentially, just more slowly. I have a plot demonstrating this below but I just want to say; thank you for looking into this @gvozdvmozgu and @benfdking , I didn't expect a fix to be done so quickly! This really improves things for my use-case.
Here is how the plot with lines for both before and after the fix if you are interested:
Here is a plot using sqruff
after the fix in comparison to sqlfluff
. I know this is a special case as it is a simple example and in my experience for more complex files sqruff
will generally beat sqlfluff
in performance.
Even so I want to share this plot that shows the sqruff
lint time increasing exponentially while sqlfluff
lint time looks like it is increasing linearly or at least with a much less steep curve. The sqlfluff
byte limit kicks in a bit after 1500 lines, that is why there us a drop off for the sqlfluff
times.
How can I replicate this?
How can I replicate this?
I created the data for all of these plots with the above python script. Running the scriot creates a file called results.csv
, you just need to rename/move the results file after each run.
To run for the latest sqruff I changed this line:
subprocess.run(["sqruff", "lint"]
To instead be
subprocess.run(["cargo", " run", "lint"]
(This works if you are running the script from the cli
directory with the .sqruff
file present as cargo run is building/running the sqruff cli here)
To run for sqlfluff
, sqlfluff
must be installed and that line must be changed to:
subprocess.run(["sqlfluff", "lint"]
Hope that helps. If not I will try and create a notebook which will hopefully be a bit more reproducible and share that with you
One other thing, you will want to change the value inside the range function to control how many loops there are and how many files are created.
To create that last graph there were 300 loops.
One more thing; when running for sqlfluff
you need to create the .sqlfluff
file with the config that I gave above.
Ok, I have made a single script that you should work as long as you have sqlfluff installed and you run it from the cli
directory of the sqruff
repo. It will create two files: sqruff_results.csv
and sqlfluff_results.csv
:
import random
import subprocess
import time
initial_statement_sql = """CREATE TABLE IF NOT EXISTS `project_id.dataset_id.table_id` (
id STRING NOT NULL,
product STRING NOT NULL,
cost INTEGER NOT NULL,
timestamp_added TIMESTAMP NOT NULL
);
INSERT INTO `project_id.dataset_id.table_id` (
id,
product,
cost,
timestamp_added
)
VALUES"""
def run_sqruff_check(size_param: int, output_filename: str, command: list):
insert_value_strings_list = []
for i in range(size_param):
insert_value_strings_list.append(f"""(
{i},
{random.choice(["bike", "car", "tv", "orange"])},
{random.choice([100, 200, 300, 400])},
current_timestamp()
)""")
output=initial_statement_sql+"\n"+",\n".join(insert_value_strings_list)
with open("./test.sql", "w") as file:
file.write(output)
start_time = time.time()
subprocess.run(command, check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
with open(f"./{output_filename}","a") as file:
file.write(f"{(size_param*6)+14}, {time.time()-start_time}\n")
print(f"{(size_param*6)+14}, {time.time()-start_time}")
def run_performance_checks(loops:int, output_filename: str, command: list):
with open("./.sqlfluff", "w") as file:
file.write("""
[sqlfluff]
dialect = bigquery
""")
for size_param in range(loops):
run_sqruff_check(size_param, output_filename, command)
print("running sqruff performance checks")
run_performance_checks(200, "sqruff_results.csv", ["cargo", "run", "lint"])
print("running sqlfluff performance checks")
run_performance_checks(200, "sqlfluff_results.csv", ["sqlfluff", "lint"])
It only runs up to a file size of 1208 lines, but that is enough to see that the performance of sqruff starts lagging behind the performance of sqlfluff (at least on my machine)
let me know if it doesn't work for some reason
I have tested this by generating sql files with an increasing number of lines, linting each file and measuring the time it takes.
Here is the result in a graph form:
In case you would like to try and replicate this: Here is my config file:
Here the final file with 608 lines: test_sql.txt (I couldnt upload a .sql file here for some reason)
Here is the python script I used to generate the different files: