Streaming JSON rows into the database rather than loading all rows into memory first
This plus the above -> JSON change brings the octosql benchmark down from 69s to 41s:
$ time ./dsq yellow_tripdata_2021-04.csv 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM {} GROUP BY passenger_count'
[{"COUNT(*)":128020,"AVG(total_amount)":32.23715114825533,"passenger_count":""},
{"COUNT(*)":42228,"AVG(total_amount)":17.021401676615067,"passenger_count":"0"},
{"COUNT(*)":1533197,"AVG(total_amount)":17.641883306799908,"passenger_count":"1"},
{"COUNT(*)":286461,"AVG(total_amount)":18.097587071145647,"passenger_count":"2"},
{"COUNT(*)":72852,"AVG(total_amount)":17.915395871092315,"passenger_count":"3"},
{"COUNT(*)":25510,"AVG(total_amount)":18.452774990196012,"passenger_count":"4"},
{"COUNT(*)":50291,"AVG(total_amount)":17.270924817567234,"passenger_count":"5"},
{"COUNT(*)":32623,"AVG(total_amount)":17.600296416636713,"passenger_count":"6"},
{"COUNT(*)":2,"AVG(total_amount)":87.17,"passenger_count":"7"},
{"COUNT(*)":2,"AVG(total_amount)":95.705,"passenger_count":"8"},
{"COUNT(*)":1,"AVG(total_amount)":113.6,"passenger_count":"9"}]
./dsq yellow_tripdata_2021-04.csv 83.04s user 4.22s system 210% cpu 41.356 total
for a standalone benchmark:
BENCHMARK=true go test -cpuprofile cpu.prof -bench . -run Test_sqlIngest
-> JSON conversion
Major themes:
For the sample 192MB file included in a benchmark now, this brought the csv->json conversion down from 19s to 8.7s.
The other big chunk of code I need to look at is the SQL ingestion in sql.go.
To run the benchmark and generate a cpu profile:
which you can then load into https://www.speedscope.app/
JSON -> database ingestion
Major changes:
This plus the above -> JSON change brings the octosql benchmark down from 69s to 41s:
for a standalone benchmark: