planetlabs / planet-client-python

Python client for Planet APIs
https://planet-sdk-for-python-v2.readthedocs.io/en/latest/
Apache License 2.0
274 stars 92 forks source link

Persistent `httpx.ConnectTimeout` when activating and downloading large volumes of UDM files #1050

Open tbarsballe opened 3 months ago

tbarsballe commented 3 months ago

Expected behavior I expect the SDK to be able to handle a large number of concurrent requests under typical usage patterns, e.g. activating and downloading about a thousand UDM files.

Actual behavior (describe the problem) We are trying to download a week's worth of PSScene UDMs for 150 different small AOIs, all in parallel. When running many UDM activations in parallel, we see persistent httpx.ConnectTimeout failures during DataClient.wait_asset and DataClient.download_asset. Rarely, I instead see httpx.PoolTimeout, though it seems like this error can come from any DataClient interaction.

Based on the discussion in #580, it appears httpx.ConnectTimeout is not currently retried, because it didn't occur frequently during that set of testing (which was mainly focused on the orders API). I expect we are seeing a higher incidence of connect failures here because UDMs are small and quick to download.

However, just adding httpx.ConnectTimeout to RETRY_EXCEPTIONS is not sufficient to fix this issue - when I tried this, I reliably got the httpx.PoolTimeout, that I was more rarely seeing before. That failure appears more difficult to resolve, as it is coming from the session as a whole rather than an individual request. Once more revisiting #580, the PoolTimeout error seems correlated to the total volume of concurrent requests (which, for this use case, should be no greater than 1050).

Obviously, resolving the underlying issues would be optimal, but barring that there is another problem here - these sorts of failures produce errors that are difficult or impossible to troubleshoot and solve by the end-user. If there is in fact an upper limit on the number of concurrent requests the SDK can handle (as seems to be implied by this failure, and the discussion in #580), some documentation describing those limits, and the kinds of errors they cause, would be beneficial.

Related Issues

Workaround The only workaround we've found has been to not use the SDK.

Minimum, Complete, Viable Code Sample None at the moment, but I can provide a link to the project this error was encountered on.

Environment Information

Installation Method

tbarsballe commented 3 months ago

@adripollack @joshuachungplanet - I've documented the issue you've been seeing here, please add any details I've missed