Closed jpmckinney closed 6 months ago
Actually, without --ndjson
, but with a non-array input, it still succeeds with threads 1 and 2 (even though it should presumably fail for the provided reason), but fails otherwise.
Edit: And if I use the array example without --ndjson
:
[{
"id": 1,
"title": "A Film",
"type": "film"
},
{
"id": 2,
"title": "A Game",
"type": "game"
}]
I get the same behavior re: thread count.
Using --json-stream
instead of --ndjson
yields the same results.
@jpmckinney
Yes, it fails on datasets where the amount of rows is less than the amount of threads, i.e when a thread in effect will have nothing to do. I have been aware of this, but yet to find an elegant solution or better failure mode. The use of threads is only really beneficial for larger datasets, and be slower for such small datasets, so I hoped someone would not spot it! I should at least document this, though. Thanks for raising the issue.
I have added a comment in the docs about this and may investigate a way to avoid this error in future.
Hmm, in our case we were always setting threads=0
, but occasionally there's a tiny dataset that fails due to this issue on a 16-CPU server. I guess I'll count up to num-cpu JSON objects before setting threads
.
@jpmckinney I understand. It is surprisingly tricky to fix without either a lot of special casing for this issue, without compromising the check that determines if a single threaded run returns no results.
Something like the following could be used to set the number of threads:
num_threads = len(flatterer.flatten('t', preview=16, dataframe=True, ndjson=True)['data']['main'])
this will only read the first 16 lines and will be pretty quick.
Yes, since I'm working with JSON Lines, I just open the file and count lines, breaking once the number of CPUs is reached.
I think this was fixed with #52 as I just tested with lots of threads and few rows and it appears to work for ndjson, json-stream and when selecting a path.