Closed voganrc closed 2 years ago
Merging #621 (4fb6b1a) into master (f8c427c) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #621 +/- ##
=======================================
Coverage 85.31% 85.31%
=======================================
Files 85 85
Lines 4929 4929
Branches 783 783
=======================================
Hits 4205 4205
Misses 584 584
Partials 140 140
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/reader.py | 89.32% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f8c427c...4fb6b1a. Read the comment docs.
I am not sure if this PR fixes all sources of rows order determinism. Once workers_count>1, there would be a race between workers resulting in rows order determinism. It should be possible to add a reordering buffer to make sure the order of rows is deterministic (I think pytorch does something similar with its DataLoader), but that would be a bit larger effort. I am ok with landing this PR to add some more stability to the order.
Hi. Do you plan keep working on this PR, or we should close it?
Ok, I'll close it
I have the following script that converts a Spark DataFrame into a TensorFlow Dataset:
Sometimes it outputs:
But other times it outputs:
I've found that the first happens when
spark_converter.file_urls
is:And the second happens when
spark_converter.file_urls
is:Sorting the file urls fixes this though, and results in a deterministic read order.