ssl-hep / ServiceX_frontend

Client access library for ServiceX
https://servicex-frontend.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
5 stars 11 forks source link

Protect client against bad replies when requesting transform status from server #450

Closed ponyisi closed 1 month ago

ponyisi commented 1 month ago

When testing against the AGC I see frequent errors of the form

  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/servicex/query_core.py", line 204, in transform_complete
    raise task.exception()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/servicex/query_core.py", line 363, in transform_status_listener
    await self.retrieve_current_transform_status()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/servicex/query_core.py", line 433, in retrieve_current_transform_status
    s = await self.servicex.get_transform_status(self.request_id)
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/servicex/servicex_adapter.py", line 142, in get_transform_status
    o = await r.json()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/aiohttp/client_reqrep.py", line 1194, in json
    await self.read()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/aiohttp/client_reqrep.py", line 1134, in read
    self._body = await self.content.read()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/aiohttp/streams.py", line 383, in read
    block = await self.readany()
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/aiohttp/streams.py", line 405, in readany
    await self._wait("readany")
  File 
"/home/onyisi/servicex/analysis-grand-challenge/analyses/cms-open-data-ttbar/venv/lib64/python3.9/site-package
s/aiohttp/streams.py", line 312, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed: <ContentLengthError: 400, 
message='Not enough data for satisfy content length header.'>

The client needs protection against these responses (which I believe are transient ... ?) A request should not die on the client side simply because one response failed. (That said I think there are some real problems with the HTTP proxies for connections to the SSL k8s ...)

Marking as a 3.0 thing because this really shows up on almost all my attempted submissions.

kyungeonchoi commented 1 month ago

If fallback/retry works as expected, it retries 5 times and should wait for ~2mins (10 + 2*10 + 30 + 30 + 30). And we check transform status every 5sec. BTW I haven't seen this error before. Could you let me know what is the version of your aiohttp package?

ponyisi commented 1 month ago

Hi @kyungeonchoi the version is 3.10.5.

I suspect the issue here is that the connection succeeds, but the payload that is returned is bad (due to some weird proxy issue or something), and so the retry logic doesn't kick in (it is not a timeout or a 5xx error).

gordonwatts commented 1 month ago

I had to add retry's like this to get things working for the 200 Gbps.

I think eveyrthing but the transform submission needs retries. :-) The reason not to do submission is that when I did, and the server didn't reply, the transform was still being submitted, so I ended up making things worse by just re-submitting. :-)

kyungeonchoi commented 1 month ago

@ponyisi - I've created a PR #469 to fix this issue. Please let me know if it looks good to you. I will add tests then.

ponyisi commented 1 month ago

Retries added