iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Build query library #73

Closed gordonwatts closed 6 months ago

gordonwatts commented 6 months ago

We need to have a query that does some heavy skimming of the SX query to better understand:

Do this by putting the queries in a file that can then be easily imported. This is just to create say 3 queries for xaod:

  1. All the data (what we do now)
  2. Jet cuts as @alexander-held recommends
  3. Super tight to really exaggerate the differences

Once this is in, then we can add even more queries (from different back-ends!).

gordonwatts commented 6 months ago

Running on the full 1 TB sample with the small selection takes: 4:54. Running on the full 1 TB sample with the all selection takes: 11:03

gordonwatts commented 6 months ago

Dask comput logs. First, for the all dataset:

0667.2845 - INFO - Using `uproot.dask` to open files (splitting files 2 ways).
0667.6733 - INFO - Generating the dask compute graph for 34 fields
0667.6738 - INFO - Field event_number is not a scalar field. Skipping count.
0667.6741 - INFO - Field run_number is not a scalar field. Skipping count.
0668.0005 - INFO - Number of tasks in the dask graph: optimized: 20,788 unoptimized 258,512
0668.0006 - INFO - Computing the total count
0801.3024 - INFO - Done: result = 129,913,000

And it seems to fail for the small case:

0001.0025 - INFO - Computing the total count
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 328, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 169, in main
    r = total_count.compute()  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1314, in __call__
    (result, counters), duration = with_duration(self._call_impl)(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1152, in wrapper
    result = f(*args, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1296, in _call_impl
    return self.read_tree(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 983, in read_tree
    mapping = self.form_mapping_info.load_buffers(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 906, in load_buffers
    arrays = tree.arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 823, in arrays
    _ranges_or_baskets_to_arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 2993, in _ranges_or_baskets_to_arrays
    branchid_to_branch[cache_key]._awkward_check(interpretation)
KeyError: '31fc00d0-07ef-11ef-b2c4-bc4110acbeef:/atlas_xaod_tree;1:jet_EnergyPerSampling(6)'

If that is repeatable we'll need to follow up.