ServiceX Client S3 tells you to slow down with too many dask workers

gordonwatts commented 6 months ago

After fixing #20 , we were able to run with more DASK workers. But we can still trigger this error:

(venv) [bash][gwatts]:idap-200gbps-atlas > python servicex/servicex_materialize_branches.py -v --distributed-client scheduler --dask-scheduler 'tcp://dask-gwatts-ead73a76-c.af-jupyter:8786' --num-files 0
0000.0744 - INFO - Using release 21.2.231
0000.0749 - INFO - Building ServiceX query
0000.1044 - WARNING - Fetched the default calibration configuration for a query. It should have been intentionally configured - using configuration for data format PHYS
0000.1327 - INFO - Starting ServiceX query
0000.7497 - INFO - Running servicex query for f70228e6-6655-443a-a7f2-77de0937d134 took 0:00:00.278472 (no files downloaded)                                      
0000.7583 - INFO - Finished ServiceX query
0000.7593 - INFO - Using `uproot.dask` to open files
0001.2214 - INFO - Generating the dask compute graph for 27 fields
0001.3238 - INFO - Computing the total count
Traceback (most recent call last):
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 325, in <module>
    main(ignore_cache=args.ignore_cache, num_files=args.num_files,
  File "/home/gwatts/code/iris-hep/idap-200gbps-atlas/servicex/servicex_materialize_branches.py", line 228, in main
    r = total_count.compute()  # type: ignore
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/venv/lib/python3.9/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1343, in __call__
    result, _ = self._call_impl(
  File "/venv/lib/python3.9/site-packages/uproot/_dask.py", line 1266, in _call_impl
    ttree = uproot._util.regularize_object_path(
  File "/venv/lib/python3.9/site-packages/uproot/_util.py", line 962, in regularize_object_path
    file = ReadOnlyFile(
  File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 761, in root_directory
    return ReadOnlyDirectory(
  File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 1400, in __init__
    keys_chunk = file.chunk(keys_start, keys_stop)
  File "/venv/lib/python3.9/site-packages/uproot/reading.py", line 1185, in chunk
    return self._source.chunk(start, stop)
  File "/venv/lib/python3.9/site-packages/uproot/source/fsspec.py", line 115, in chunk
    data = self._fh.read(stop - start)
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 598, in read
    return super().read(length)
  File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1846, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/venv/lib/python3.9/site-packages/fsspec/caching.py", line 439, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/venv/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 653, in async_fetch_range
    r.raise_for_status()
  File "/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1060, in raise_for_status
    raise ClientResponseError(
Exception: ClientResponseError(RequestInfo(url=URL('https://s3.af.uchicago.edu/f70228e6-6655-443a-a7f2-77de0937d134/root:::192.170.240.145::root:::eosatlas.cern.ch:1094::eos:atlas:atlasdatadisk:rucio:mc23_13p6TeV:e5:17:DAOD_PHYSLITE.37223155._000341.pool.root.1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ABAOJZ4XMLKWO5H0PZJ3/20240412/af-object-store/s3/aws4_request&X-Amz-Date=20240412T190811Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=385d92df18e0cad7e071e0dc84ef8c72fc32d8ec2f02a63bf1fd97d2304083f9'), method='GET', headers=<CIMultiDictProxy('Host': 's3.af.uchicago.edu', 'Range': 'bytes=30381811-35624926', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'Python/3.9 aiohttp/3.9.3')>, real_url=URL('https://s3.af.uchicago.edu/f70228e6-6655-443a-a7f2-77de0937d134/root:::192.170.240.145::root:::eosatlas.cern.ch:1094::eos:atlas:atlasdatadisk:rucio:mc23_13p6TeV:e5:17:DAOD_PHYSLITE.37223155._000341.pool.root.1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ABAOJZ4XMLKWO5H0PZJ3/20240412/af-object-store/s3/aws4_request&X-Amz-Date=20240412T190811Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=385d92df18e0cad7e071e0dc84ef8c72fc32d8ec2f02a63bf1fd97d2304083f9')), (), status=503, message='Slow Down', headers=<CIMultiDictProxy('Date': 'Fri, 12 Apr 2024 19:08:48 GMT', 'Content-Type': 'application/xml', 'Content-Length': '211', 'Connection': 'keep-alive', 'x-amz-request-id': 'tx00000000000000002daba-00661986c0-7b36232-af-object-store', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains')>)

This is testing with workers already setup (not dynamically scaling). It occurs with:

800 workers
600 workers
400 workers
200 workers

I think what is happening is 200 workers hit S3 at exactly the same time and that causes its slow down message. With dynamic scaling, the nodes slowly come up, and so the S3 load is spread out a little bit.

gordonwatts commented 6 months ago

@fengpinghu - do we know how many simultaneous request S3 should handle? Perhaps we have to limit it somehow?

gordonwatts commented 6 months ago

Tried, but SX isn't transforming right now. Being followed up in #servicex.

gordonwatts commented 6 months ago

Will need backoff to get this to work. So, we'll leave this as unfixed and try out #44

iris-hep / idap-200gbps-atlas

ServiceX Client S3 tells you to slow down with too many dask workers #30