Open alexander-held opened 6 months ago
A self-contained reproducer can be found at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b.
follow-up in https://github.com/CoffeaTeam/coffea/issues/1073
reading through the bug reports over on the cofeea
site, it feels like this bug is going to take a while. So fixing this is going to be blocked for a while.
My current assumption is that the reported values we see might be correct and we just end up reading some information multiple times. That is inefficient and should be resolved, but for the purpose of evaluating our metric of data being read and arriving at a CPU for processing, I believe it tells us the correct thing.
This presumably has some impact on #26: from some very rough comparisons, it seems like we end up reading 50% or so more than we strictly need, which is a lot of duplication. This artificially inflates our "fraction of file read" when we look at it in terms of "number of bytes read out of this file / file size" (that's what we use right now to calculate the throughput metric with coffea) but does not affect the metric "number of unique bytes read / file size" (which should be closer to the way we build the list of branches originally for the 25% target).
Ideally we avoid duplicate reading and resolve that difference, otherwise we need to think a bit about how to present the results.
As observed in new
materialize_branches
notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.