Open jamesprinc3 opened 4 years ago
After more digging I think I'm understanding what's going on a little more. The partition filter is resolved higher up the callstack (i.e. not on the worker), and at the moment only one level of partitioning is supported:
I think I've fallen into an interesting edge case where I'm partitioning by 2 columns but only filtering on one of them.
Seems like a trivial fix to change line 535 mentioned in the comment above to read:
if set(predicate_fields).issubset(dataset.partitions.partition_names):
It works for my example but this feels far too easy
I've opened this PR as a starting point: https://github.com/uber/petastorm/pull/488
I've not been able to get the relevant tests to run locally yet, so maybe Travis will give me some feedback in the meantime.
Hello,
I spotted an error when running some code which I've managed to reproduce by modifying one of the petastorm tests:
The error (note the line numbers are a little different because I've added some printlns whilst debugging):
I've logged out the values of
num_partitions
andnum_rows
, the latter seems to be the suspect which is causing the division by zero error.I've had a look through the code in
py_dict_reader_worker.py
but I'm not particularly familiar with a lot of the petastorm APIs, I'm hoping someone might have seen something similar before which will make it easier to get a fix out.Versions:
pyarrow==0.15.1 petastorm==0.8.2