Closed gordonwatts closed 4 months ago
Saw this crash while trying to get to the bottom of this:
0001.6921 - INFO - root - Computing the total count
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 646, in async_fetch_range
r = await self.session.get(
File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 608, in _request
await resp.start(conn)
File "/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 991, in start
self._continue = None
File "/venv/lib/python3.9/site-packages/aiohttp/helpers.py", line 735, in __exit__
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 338, in <module>
main(
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 170, in main
r = dask.compute(total_count) # type: ignore
File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1345, in __call__
)
File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1268, in _call_impl
object_path,
File "/venv/lib/python3.9/site-packages/uproot/_util.py", line 967, in regularize_object_path
**options,
File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 573, in __init__
self._begin_chunk = self._source.chunk(
File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 115, in chunk
data = self._fh.read(stop - start)
File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 598, in read
return super().read(length)
File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1846, in read
out = self.cache._fetch(self.loc, self.loc + length)
File "/venv/lib/python3.9/site-packages/fsspec/caching.py", line 421, in _fetch
self.cache = self.fetcher(start, bend)
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 101, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError0001.6921 - INFO - root - Computing the total count
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 646, in async_fetch_range
r = await self.session.get(
File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 608, in _request
await resp.start(conn)
File "/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 991, in start
self._continue = None
File "/venv/lib/python3.9/site-packages/aiohttp/helpers.py", line 735, in __exit__
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 338, in <module>
main(
File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 170, in main
r = dask.compute(total_count) # type: ignore
File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1345, in __call__
)
File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1268, in _call_impl
object_path,
File "/venv/lib/python3.9/site-packages/uproot/_util.py", line 967, in regularize_object_path
**options,
File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 573, in __init__
self._begin_chunk = self._source.chunk(
File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 115, in chunk
data = self._fh.read(stop - start)
File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 598, in read
return super().read(length)
File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1846, in read
out = self.cache._fetch(self.loc, self.loc + length)
File "/venv/lib/python3.9/site-packages/fsspec/caching.py", line 421, in _fetch
self.cache = self.fetcher(start, bend)
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 101, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
The above was with 1000 cores. And that timeout seems to have been repeatable.
When running you can look at the task list and you'll see:
And the length of that one task is 300 seconds - which is 5 minutes. So that is some sort of timeout.
With all the latest updates - in partiuclar, the ones that fixed up how we dealt with optimization, we are seeing a dramatic slow down:
Processing the data is odne in a few seconds with a large DASK cluster. However, there is one task that takes... waay too much time. Almost a minute on its own. I suspect this has to do with how we altered the optimiation.