iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Fix up how we do the non-zero count so DASK can better optimize #48

Closed gordonwatts closed 6 months ago

gordonwatts commented 6 months ago

See https://github.com/dask-contrib/dask-awkward/issues/499#issuecomment-2063241077 for more information. Basically - use axis=1 for each file/sample. This should substantially reduce the number of tasks.

gordonwatts commented 6 months ago

Hoping @alexander-held does this for the uproot reading version so we can copy.

alexander-held commented 6 months ago

You can track the progress of that in #42 as well.

alexander-held commented 6 months ago

See #50 for the solution implemented for coffea.

alexander-held commented 6 months ago

Another example is in https://github.com/iris-hep/idap-200gbps/pull/7/files, a similar strategy should work with uproot.dask for the ServiceX use case.

gordonwatts commented 6 months ago

When we run our small test, locally, DASK says there are 2230 tasks scheduled (this is before doing any of these modifications).

gordonwatts commented 6 months ago

Hmmm - with this new method there are 2228 tasks instead of 2230. That seems like a very small shrinkage!