issues
search
multiprocessio
/
dsq
Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.
Other
3.71k
stars
152
forks
source link
Performance ideas
#72
Open
eatonphil
opened
2 years ago
eatonphil
commented
2 years ago
Catchall for now for potential improvements to datastation/dsq.
SQL pre-processing
Import only used fields (see #71)
Do pre-filtering of data in SQLiteWriter, only insert things that match the WHERE clause
Support more input types using SQLiteWriter, basically requires supporting expanded nested objects in (see notes in #67 )
Maybe Handle jsonl in parallel since newlines must not be within individual JSON lines
Get rid of map[string]any inside datastation
At the very least put WriteRecord into the ResultWriter interface so SQLiteWriter can avoid map[string]any which it converts from anyway
CSV parser improvements
Find a simdcsv Go implementation (
https://github.com/minio/simdcsv
is abandoned) or write a wrapper to
https://github.com/geofflangdale/simdcsv
Maybe easier first step: write a parser that handles CSVs when there are no quotes and fall back to encoding/csv otherwise
Or actually investigate why encoding/csv is slow
Add benchmarks for every file format, not just CSV. Basically every file format needs to be worked on individually
Catchall for now for potential improvements to datastation/dsq.