Closed hig-dev closed 4 years ago
This is likely to be a result of a race between multiple reader threads. Try passing make_reader(..., workers_count=1)
- that should make reading order deterministic. Unfortunately, you are going to get lower throughput rate.
This problem can be properly mitigated by adding a reordering queue to petastorm implementation, but we do not have it right now.
Thanks for the tip. The workaround of setting workers_count=1
in make_reader
did work.
I will let you decide, if you want to close this issue.
My goal is to read the created dataset in the order in which I generated the rows. However if the row group size is set to a value lower than the total size of the dataset, the order when reading the dataset is wrong, despite setting shuffle_row_groups=False.
Please look at this demonstration of this problem. I would expect that the exception does not occur.