iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Understand and Fix DASK Crash #87

Closed gordonwatts closed 4 months ago

gordonwatts commented 4 months ago

The following crash occurs when we run on any large-ish data set with a cut:

0953.0512 - INFO - root - Computing the total count
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 330, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 163, in main
    r = total_count.compute()  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1314, in __call__
    (result, counters), duration = with_duration(self._call_impl)(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1152, in wrapper
    result = f(*args, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1296, in _call_impl
    return self.read_tree(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 983, in read_tree
    mapping = self.form_mapping_info.load_buffers(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 906, in load_buffers
    arrays = tree.arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 823, in arrays
    _ranges_or_baskets_to_arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 2993, in _ranges_or_baskets_to_arrays
    branchid_to_branch[cache_key]._awkward_check(interpretation)
KeyError: 'b49ef7c0-08ff-11ef-b68d-1738a8c0beef:/atlas_xaod_tree;1:jet_EnergyPerSampling(6)'
gordonwatts commented 4 months ago

Repro:

python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-f28f74d7-a.af-jupyter:8786' --dask-profile --num-files 0 --dataset data_special --ignore-cache --query xaod_medium
gordonwatts commented 4 months ago

The problem is steps_per_file. For running on a cluster it is set to a large number, and with tight cuts there are some files that are now too small - and I assume produce a zero-length block. And uproot breaks when that happens.

gordonwatts commented 4 months ago

So this code adjust steps_per_file to 1 for a tight and 2 for medium.