google / Xee

An Xarray extension for Google Earth Engine
Apache License 2.0
240 stars 28 forks source link

Long-running code results in `requests` `ChunkedEncodingError` exception (broken connection) #125

Open noahgolmant opened 8 months ago

noahgolmant commented 8 months ago

I have a script ingesting ~200 GB of landsat imagery with the current multi-threaded implementation (no Dataflow). Eventually, I always get an exception like:

requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(9186238 bytes read, 1299762 more expected)', IncompleteRead(9
186238 bytes read, 1299762 more expected))

This occurs in the common.robust_getitem call.

I've had some success in reducing the frequency of this exception by lowering the chunk size, so e.g. I can make it to ~150 GB instead of failing after 90, although hard to say if that improvement is reliable since it is non-deterministic.

I am not sure of the root cause of this-- it could be due to a multithreading/lock issue, or the server is prematurely closing the connection. Either way, the current code only applies the retry/backoff logic to EEExceptions. I've had success by retrying on any Exception rather than just EEException but that is not an ideal solution.

I'd imagine that we don't see this in Dataflow because it has its own worker retry logic?

naschmitz commented 7 months ago

@noahgolmant could you add some code to reproduce this issue?

Thanks!