kindly / flatterer

Opinionated JSON to CSV/XLSX/SQLITE/PARQUET converter. Flattens JSON fast.
https://flatterer.opendata.coop
MIT License
182 stars 7 forks source link

Flatterer fails if the number of threads used is less than the amount of rows in the data. #46

Closed jpmckinney closed 6 months ago

jpmckinney commented 1 year ago
echo '{"foo":"bar"}
{"foo":"bar"}' > t
flatterer --ndjson --force --threads 0 t out # fails
flatterer --ndjson --force --threads 1 t out # OK
flatterer --ndjson --force --threads 2 t out # OK
flatterer --ndjson --force --threads 3 t out # fails
RuntimeError: Error: The JSON provided as input is not an array of objects
jpmckinney commented 1 year ago

Actually, without --ndjson, but with a non-array input, it still succeeds with threads 1 and 2 (even though it should presumably fail for the provided reason), but fails otherwise.

Edit: And if I use the array example without --ndjson:

[{
  "id": 1,
  "title": "A Film",
  "type": "film"
},
{
  "id": 2,
  "title": "A Game",
  "type": "game"
}]

I get the same behavior re: thread count.

jpmckinney commented 1 year ago

Using --json-stream instead of --ndjson yields the same results.

kindly commented 1 year ago

@jpmckinney

Yes, it fails on datasets where the amount of rows is less than the amount of threads, i.e when a thread in effect will have nothing to do. I have been aware of this, but yet to find an elegant solution or better failure mode. The use of threads is only really beneficial for larger datasets, and be slower for such small datasets, so I hoped someone would not spot it! I should at least document this, though. Thanks for raising the issue.

kindly commented 1 year ago

I have added a comment in the docs about this and may investigate a way to avoid this error in future.

jpmckinney commented 1 year ago

Hmm, in our case we were always setting threads=0, but occasionally there's a tiny dataset that fails due to this issue on a 16-CPU server. I guess I'll count up to num-cpu JSON objects before setting threads.

kindly commented 1 year ago

@jpmckinney I understand. It is surprisingly tricky to fix without either a lot of special casing for this issue, without compromising the check that determines if a single threaded run returns no results.

Something like the following could be used to set the number of threads: num_threads = len(flatterer.flatten('t', preview=16, dataframe=True, ndjson=True)['data']['main'])

this will only read the first 16 lines and will be pretty quick.

jpmckinney commented 1 year ago

Yes, since I'm working with JSON Lines, I just open the file and count lines, breaking once the number of CPUs is reached.

kindly commented 6 months ago

I think this was fixed with #52 as I just tested with lots of threads and few rows and it appears to work for ndjson, json-stream and when selecting a path.