iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Smarter way to count branches #62

Closed gordonwatts closed 6 months ago

gordonwatts commented 6 months ago

On a 2 file run this goes from 2230 tasks to 2228. Hmmm....

Fixes #48

alexander-held commented 6 months ago

After changes I would expect the numbers reported in the dashboard to be something like the number of files times the average amount of steps taken in the file (something like a factor 2 or so in the materialize_branches notebook I believe). I saw a very drastic reduction of almost two orders of magnitude. Examples for the graphs before and after are in https://github.com/iris-hep/idap-200gbps/pull/7, which shows the CMS version but that behaves the same. How does .visualize(optimize_graph=True) on the graph look like in this case here?

gordonwatts commented 6 months ago

Ok - here is before @alexander-held optimization trick:

image

And after:

image

gordonwatts commented 6 months ago

So, this is doing what we expect. The reason I was fooled was because I was doing len(total_count.dask) and that:

Before Optimization:

0003.5013 - INFO - Number of tasks in the dask graph: 172

After Optimization:

0003.5437 - INFO - Number of tasks in the dask graph: 118

So that looks like it should be 15, not 118. In short - I do not understand what len(total_count.dask) is doing.

gordonwatts commented 6 months ago

See issue #65 for follow up for the counting number (optimized graph vs non-optimized?).