noborus / trdsql

CLI tool that can execute SQL queries on CSV, LTSV, JSON, YAML and TBLN. Can output to various formats.
https://noborus.github.io/trdsql/
MIT License
2.01k stars 77 forks source link

Querying set of CSVs which a lot of entries does not output any json #101

Closed sbeaupre closed 4 years ago

sbeaupre commented 4 years ago

I am working on a medium-large open data set of Belgian companies, which is freely downloadable at https://kbopub.economie.fgov.be/kbo-open-data/login

I am trying to execute following query to denormalize data into 1 file and it works fine when the output target is CSV, but nothing happens when outputting JSON:

docker run --rm -it -v $(pwd):/tmp noborus/trdsql -ojson -ih " \
SELECT * FROM /tmp/enterprise.csv as e \
LEFT JOIN /tmp/address.csv as a ON (e.EnterpriseNumber=a.EntityNumber) \
LEFT JOIN /tmp/contact.csv as c ON (e.EnterpriseNumber=c.EntityNumber) \
LEFT JOIN /tmp/denomination.csv as d ON (e.EnterpriseNumber=d.EntityNumber) \
";

Any ideas?

noborus commented 4 years ago

I want more information.

Does trdsql exit without outputting anything (errors, etc.) in JSON? Can you output CSV and JSON with SQL LIMIT?

sbeaupre commented 4 years ago

CSV output works fine, with or without LIMIT.

JSON output works with a LIMIT (eg. 10), but not without LIMIT

JSON without LIMIT gives no output at all, not even error messages. Tried also with debug flag, but no extra info is returned.

noborus commented 4 years ago

Hmm ... I actually downloaded and tried it. The result seems to have been output successfully.

docker run --rm -it -v $(pwd):/tmp noborus/trdsql -ojson -ih " \ 
SELECT * FROM /tmp/enterprise.csv as e \
LEFT JOIN /tmp/address.csv as a ON (e.EnterpriseNumber=a.EntityNumber) \
LEFT JOIN /tmp/contact.csv as c ON (e.EnterpriseNumber=c.EntityNumber) \
LEFT JOIN /tmp/denomination.csv as d ON (e.EnterpriseNumber=d.EntityNumber) \
" >tmp.json
ls -alFh tmp.json
-rw-r--r-- 1 noborus noborus 1.4G Nov 25 13:48 tmp.json
sbeaupre commented 4 years ago

Saw that you are running on linux and realized that it is a memory issue.

Pure from an install on Mac, it works and more than 12GB is consumed to produce the json. Running inside docker with more resources works also fine. Strange that docker gave no error message.

Looking at the code (not a go dev myself), I suppose you are building everything in memory first and then write it to disk, and not streaming the data?

noborus commented 4 years ago

Saw that you are running on linux and realized that it is a memory issue.

Pure from an install on Mac, it works and more than 12GB is consumed to produce the json. Running inside docker with more resources works also fine. Strange that docker gave no error message.

I see. understood.

Looking at the code (not a go dev myself), I suppose you are building everything in memory first and then write it to disk, and not streaming the data?

Yes. Since it is currently an array, it is temporarily stored in memory.

For example

[{"id":"1","name":"Orange"},
 {"id":"2","name":"Melon"},
 {"id":"3","name":"Apple"}]

For large outputs, you will need an output that is not an array (I don't know what to call this).

{"id":"1","name":"Orange"}
{"id":"2","name":"Melon"}
{"id":"3","name":"Apple"}

Not yet. I will consider it from now on.

noborus commented 4 years ago

I added a new JSONL output option ( #103 ). This format can reduce memory consumption.