iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

New rate tests using the 50 TB sample #119

Closed gordonwatts closed 3 months ago

gordonwatts commented 3 months ago

Re-run on the 50 TB to collect new statistics

gordonwatts commented 3 months ago

Here is a first run. Had a DASK crash, however:

(venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset data_50TB --query xaod_medium --num-files 0
0000.0464 - INFO - root - Using release 22.2.107 for type information.
0000.0805 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0000.8260 - INFO - root - Running over 1 datasets, 49.632 TB and 6,367,686,831 events.
0000.8264 - INFO - root - Building ServiceX query
0000.8267 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0000.8268 - INFO - root - Running on the full dataset(s).
0000.8268 - INFO - root - Starting ServiceX query
0000.8446 - INFO - servicex.servicex_client - Returning code generators from cache
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?  
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?  0034.2359 - INFO - servicex.query - ServiceX Transform speed_test_data18_13TeV:data18_13TeV.periodAllYear.phys
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?              
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?              
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64803/64803 21:34
1365.9857 - INFO - root - Event rate for ServiceX: 00:22:45 time, 4664.43 kHz, Data rate: 290.85 Gbits/s
1365.9857 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4539 files
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 508, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 196, in main
    report, n_events = dask.compute(*calculate_n_events(dataset_files, steps_per_file))
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/distributed/client.py", line 2232, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('from-uproot-111c3b12b31a4842a5be37366b45c1a8', 7606) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://172.16.69.33:41517. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
gordonwatts commented 3 months ago

image

gordonwatts commented 3 months ago

Unable to get DASK to run:

(venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset data_50TB --query xaod_medium --num-files 0

0000.0445 - INFO - root - Using release 22.2.107 for type information.
0000.0780 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0000.7738 - INFO - root - Running over 1 datasets, 49.632 TB and 6,367,686,831 events.
0000.7743 - INFO - root - Building ServiceX query
0000.7746 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0000.7747 - INFO - root - Running on the full dataset(s).
0000.7748 - INFO - root - Starting ServiceX query
0000.7983 - INFO - servicex.servicex_client - Returning code generators from cache
0000.8135 - INFO - servicex.query - Returning results from cache

0000.8149 - INFO - root - Event rate for ServiceX not calculated since cached result was used
0000.8150 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4539 files
0044.9130 - INFO - root - Number of skimmed events: 81086783 (skim percent: 1.2734%)
0045.4131 - INFO - root - Using `uproot.dask` to open files (splitting files 2 ways).
0045.4133 - INFO - root - Starting build of DASK graphs
0046.2517 - INFO - root - Computing the total count
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 508, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 225, in main
    results = dask.compute(*all_tasks_to_run)  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1316, in __call__
    (result, counters), duration = with_duration(self._call_impl)(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1154, in wrapper
    result = f(*args, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1298, in _call_impl
    return self.read_tree(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 985, in read_tree
    mapping = self.form_mapping_info.load_buffers(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 908, in load_buffers
    arrays = tree.arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 823, in arrays
    _ranges_or_baskets_to_arrays(
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 3105, in _ranges_or_baskets_to_arrays
    uproot.source.futures.delayed_raise(*obj)
  File "/venv/lib/python3.9/site-packages/uproot/source/futures.py", line 38, in delayed_raise
    raise exception_value.with_traceback(traceback)
  File "/venv/lib/python3.9/site-packages/uproot/behaviors/TBranch.py", line 3026, in chunk_to_basket
    basket = uproot.models.TBasket.Model_TBasket.read(
  File "/venv/lib/python3.9/site-packages/uproot/model.py", line 854, in read
    self.read_members(chunk, cursor, context, file)
  File "/venv/lib/python3.9/site-packages/uproot/models/TBasket.py", line 227, in read_members
    ) = cursor.fields(chunk, _tbasket_format1, context)
  File "/venv/lib/python3.9/site-packages/uproot/source/cursor.py", line 201, in fields
    return format.unpack(chunk.get(start, stop, self, context))
  File "/venv/lib/python3.9/site-packages/uproot/source/chunk.py", line 446, in get
    self.wait(insist=stop)
  File "/venv/lib/python3.9/site-packages/uproot/source/chunk.py", line 388, in wait
    self._raw_data = numpy.frombuffer(self._future.result(), dtype=self._dtype)
  File "/venv/lib/python3.9/site-packages/uproot/source/coalesce.py", line 36, in result
    return self._parent.result(timeout=timeout)[self._s]
TypeError: 'ClientPayloadError' object is not subscriptable
gordonwatts commented 3 months ago

Or this one next time around:

0000.8390 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4539 files
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 508, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 196, in main
    report, n_events = dask.compute(*calculate_n_events(dataset_files, steps_per_file))
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/distributed/client.py", line 2232, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('from-uproot-b2342fd60c95f65a9d3451c1468ef4ce', 6672) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://172.16.67.205:39467. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
gordonwatts commented 3 months ago

With the fixed up fsspec access:

(venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-2e1782e2-0.af-jupyter:8786' --dask-profile --dataset data_50TB --query xaod_medium --num-files 0
0000.0631 - INFO - root - Registering retry HTTPFileSystem and HTTPFile with fsspec on DASK cluster
0000.3077 - INFO - root - Using release 22.2.107 for type information.
0000.3426 - WARNING - func_adl.type_based_replacement - Unknown type for name len
0001.0711 - INFO - root - Running over 1 datasets, 49.632 TB and 6,367,686,831 events.
0001.0714 - INFO - root - Building ServiceX query
0001.0717 - INFO - root - Querying dataset data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026
0001.0718 - INFO - root - Running on the full dataset(s).
0001.0719 - INFO - root - Starting ServiceX query
0001.1069 - INFO - servicex.servicex_client - Returning code generators from cache
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?  
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?  0032.4425 - INFO - servicex.query - ServiceX Transform speed_test_data18_13TeV:data18_13TeV.p
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?              
Transform     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/?              
Download/URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64803/64803 21:45
1338.7573 - INFO - root - Event rate for ServiceX: 00:22:17 time, 4760.23 kHz, Data rate: 296.82 Gbits/s
1338.7574 - INFO - root - Dataset speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026 has 4696 files
1338.7575 - INFO - root - Using `uproot.dask` to open files (splitting files 2 ways).
1410.3202 - INFO - root - Number of skimmed events: 84,978,255 (skim percent: 1.3345%)
1411.3524 - INFO - root - Starting build of DASK graphs
1412.3961 - INFO - root - Computing the total count
1511.9657 - INFO - root - Event rate for DASK Calculation: 00:01:39 time, 63952.35 kHz, Data rate: 3987.74 Gbits/s
1511.9659 - INFO - root - DASK event rate over actual events: 853.46 kHz
1511.9660 - INFO - root - speed_test_data18_13TeV:data18_13TeV.periodAllYear.physics_Main.PhysCont.DAOD_PHYSLITE.grp18_v01_p6026: result = 84,978,255
Duration: 89.99 s
Tasks Information
number of tasks: 30864
compute time: 4hr 5m
disk-write time: 5.93 ms
transfer time: 291.25 s

image