iris-hep / idap-200gbps-atlas

benchmarking throughput with PHYSLITE
6 stars 1 forks source link

Slow DASK calculation #95

Closed gordonwatts closed 4 months ago

gordonwatts commented 4 months ago

With all the latest updates - in partiuclar, the ones that fixed up how we dealt with optimization, we are seeing a dramatic slow down:

0000.8398 - INFO - root - Building ServiceX query
0000.8399 - INFO - root - Using dataset mc20_13TeV.364157.Sherpa_221_NNPDF30NNLO_Wmunu_MAXHTPTV0_70_CFilterBVeto.deriv.DAOD_PHYSLITE.e5340_s3681_r13145_p6026.
0000.8400 - INFO - root - Running on the full dataset.
0000.8403 - INFO - root - Starting ServiceX query
0000.8515 - INFO - servicex.servicex_client - Returning code generators from cache
0000.8591 - INFO - servicex.query - Returning results from cache

0000.8622 - INFO - root - Using `uproot.dask` to open files (splitting files 2 ways).
0001.0958 - INFO - root - Generating the dask compute graph for 42 fields
0001.0962 - INFO - root - Field event_number is not a scalar field. Skipping count.
0001.0965 - INFO - root - Field run_number is not a scalar field. Skipping count.
0001.3618 - INFO - root - Number of tasks in the dask graph: optimized: 2,658 unoptimized 300,130
0001.3618 - INFO - root - Computing the total count
0104.5942 - INFO - root - Done: result = 478,367

Processing the data is odne in a few seconds with a large DASK cluster. However, there is one task that takes... waay too much time. Almost a minute on its own. I suspect this has to do with how we altered the optimiation.

gordonwatts commented 4 months ago

Saw this crash while trying to get to the bottom of this:

0001.6921 - INFO - root - Computing the total count
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 646, in async_fetch_range
    r = await self.session.get(
  File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 608, in _request
    await resp.start(conn)
  File "/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 991, in start
    self._continue = None
  File "/venv/lib/python3.9/site-packages/aiohttp/helpers.py", line 735, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 338, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 170, in main
    r = dask.compute(total_count)  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1345, in __call__
    )
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1268, in _call_impl
    object_path,
  File "/venv/lib/python3.9/site-packages/uproot/_util.py", line 967, in regularize_object_path
    **options,
  File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 573, in __init__
    self._begin_chunk = self._source.chunk(
  File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 115, in chunk
    data = self._fh.read(stop - start)
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 598, in read
    return super().read(length)
  File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1846, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/venv/lib/python3.9/site-packages/fsspec/caching.py", line 421, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 101, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError0001.6921 - INFO - root - Computing the total count
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 646, in async_fetch_range
    r = await self.session.get(
  File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 608, in _request
    await resp.start(conn)
  File "/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 991, in start
    self._continue = None
  File "/venv/lib/python3.9/site-packages/aiohttp/helpers.py", line 735, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 338, in <module>
    main(
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 170, in main
    r = dask.compute(total_count)  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1345, in __call__
    )
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1268, in _call_impl
    object_path,
  File "/venv/lib/python3.9/site-packages/uproot/_util.py", line 967, in regularize_object_path
    **options,
  File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 573, in __init__
    self._begin_chunk = self._source.chunk(
  File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 115, in chunk
    data = self._fh.read(stop - start)
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 598, in read
    return super().read(length)
  File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1846, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/venv/lib/python3.9/site-packages/fsspec/caching.py", line 421, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 101, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
gordonwatts commented 4 months ago

The above was with 1000 cores. And that timeout seems to have been repeatable.

gordonwatts commented 4 months ago

When running you can look at the task list and you'll see:

image

And the length of that one task is 300 seconds - which is 5 minutes. So that is some sort of timeout.