multiprocessio / datastation

App to easily query, script, and visualize data from every database, file, and API.
https://datastation.multiprocess.io
Other
2.9k stars 112 forks source link

Performance improvements to -> JSON conversion, and improvements to JSON -> database ingestion #164

Closed eatonphil closed 2 years ago

eatonphil commented 2 years ago

-> JSON conversion

Major themes:

For the sample 192MB file included in a benchmark now, this brought the csv->json conversion down from 19s to 8.7s.

The other big chunk of code I need to look at is the SQL ingestion in sql.go.

To run the benchmark and generate a cpu profile:

BENCHMARK=true go test -cpuprofile cpu.prof -bench . -run Test_transformCSV
go tool pprof cpu.prof

which you can then load into https://www.speedscope.app/

JSON -> database ingestion

Major changes:

This plus the above -> JSON change brings the octosql benchmark down from 69s to 41s:

$ time ./dsq yellow_tripdata_2021-04.csv 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM {} GROUP BY passenger_count'
[{"COUNT(*)":128020,"AVG(total_amount)":32.23715114825533,"passenger_count":""},
{"COUNT(*)":42228,"AVG(total_amount)":17.021401676615067,"passenger_count":"0"},
{"COUNT(*)":1533197,"AVG(total_amount)":17.641883306799908,"passenger_count":"1"},
{"COUNT(*)":286461,"AVG(total_amount)":18.097587071145647,"passenger_count":"2"},
{"COUNT(*)":72852,"AVG(total_amount)":17.915395871092315,"passenger_count":"3"},
{"COUNT(*)":25510,"AVG(total_amount)":18.452774990196012,"passenger_count":"4"},
{"COUNT(*)":50291,"AVG(total_amount)":17.270924817567234,"passenger_count":"5"},
{"COUNT(*)":32623,"AVG(total_amount)":17.600296416636713,"passenger_count":"6"},
{"COUNT(*)":2,"AVG(total_amount)":87.17,"passenger_count":"7"},
{"COUNT(*)":2,"AVG(total_amount)":95.705,"passenger_count":"8"},
{"COUNT(*)":1,"AVG(total_amount)":113.6,"passenger_count":"9"}]
./dsq yellow_tripdata_2021-04.csv   83.04s user 4.22s system 210% cpu 41.356 total

for a standalone benchmark:

BENCHMARK=true go test -cpuprofile cpu.prof -bench . -run Test_sqlIngest