Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open`

alexander-held commented 6 months ago

As observed in new materialize_branches notebook following #17. Prior to that update (which changes the branches being read), the data sizes being read looked comparable.

alexander-held commented 6 months ago

A self-contained reproducer can be found at https://gist.github.com/alexander-held/8af116d93e936c5930648f1dea4fb02b.

alexander-held commented 6 months ago

follow-up in https://github.com/CoffeaTeam/coffea/issues/1073

gordonwatts commented 6 months ago

reading through the bug reports over on the cofeea site, it feels like this bug is going to take a while. So fixing this is going to be blocked for a while.

alexander-held commented 6 months ago

My current assumption is that the reported values we see might be correct and we just end up reading some information multiple times. That is inefficient and should be resolved, but for the purpose of evaluating our metric of data being read and arriving at a CPU for processing, I believe it tells us the correct thing.

This presumably has some impact on #26: from some very rough comparisons, it seems like we end up reading 50% or so more than we strictly need, which is a lot of duplication. This artificially inflates our "fraction of file read" when we look at it in terms of "number of bytes read out of this file / file size" (that's what we use right now to calculate the throughput metric with coffea) but does not affect the metric "number of unique bytes read / file size" (which should be closer to the way we build the list of branches originally for the 25% target).

Ideally we avoid duplicate reading and resolve that difference, otherwise we need to think a bit about how to present the results.

iris-hep / idap-200gbps-atlas

Understand difference in `num_requested_bytes` between coffea + Dask setup and plain `uproot.open` #27