Closed MikeBeller closed 2 years ago
This is tricky... There are various issues in dask/distributed around this topic (https://github.com/dask/distributed/issues/5172 is the primary one).
Can you provide more information on the behavior of the cluster as this is computing? In particular:
Thanks Tom. I would say that this is typically happening when there is a total computation that may exceed the memory of the cluster. My assumption is that dask.distributed would spill to disk -- but in the issue you reference it does say this happens in cases of high task latency often due to 'spill to disk'. So you may be right on track.
Let me do some more digging and observing during my runs and see if I can get better answers to you questions. Will post back when I have that or better yet an isolated example.
I just posted #12 -- still getting missing dependent errors intermittently. Per your question, example in #12 was with a fixed size cluster and fully stabilized before running the workload. The code there is a repeatable example. If you wanted to run it yourself I'm sure you'd see many things I am missing.
@MikeBeller what version of distributed
is this? I hear unconfirmed rumors that this Could not find dependent
issue may be fixed in 2021.11.2
and later. @TomAugspurger maybe you should update dask as well?
@MikeBeller what version of
distributed
is this? I hear unconfirmed rumors that thisCould not find dependent
issue may be fixed in2021.11.2
and later. @TomAugspurger maybe you should update dask as well?
import dask.distributed
dask.distributed.__version__
'2021.08.1'
The RuntimeError: Set changed size during iteration
is reminiscent of https://github.com/dask/distributed/issues/4371. I don't think that was actually properly fixed, but the handle_missing_dep
function in question was removed entirely in dask/distributed#5046, so upgrading to distributed>= 2021.10.0
indeed might eliminate this problem.
I also happened to notice these transient errors from rasterio while reading the data:
RuntimeError('Error opening \'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\': RasterioIOError("\'/vsicurl/https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\' not recognized as a supported file format.")')
RuntimeError("Error reading Window(col_off=24576, row_off=12288, width=2048, height=2048) from 'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/QT/2020/02/01/S2B_MSIL2A_20200201T075039_N0212_R135_T35KQT_20200930T092316.SAFE/GRANULE/L2A_T35KQT_A015176_20200201T080648/IMG_DATA/R20m/T35KQT_20200201T075039_SCL_20m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D': RasterioIOError('Read or write failed. IReadBlock failed at X offset 48, Y offset 100: IReadBlock failed at X offset 5, Y offset 4: TIFFReadEncodedTile() failed.')")
So I'd probably add this just to ignore them for now.
stackstac.stack(...,
errors_as_nodata=(
RasterioIOError('HTTP response code: 404'),
RuntimeError("Error opening"),
RuntimeError("Error reading Window")
),
I also happened to notice these transient errors from rasterio while reading the data:
Those are from some rate limits on Azure Blob Storage (account-wide limits, not specific to your requests necessarily). If you enable CPL_CURL_VERBOSE=1
you might see logs like
< HTTP/1.1 503 Egress is over the account limit.
< CPLE_None in VSICURL: Got response_code=503
Content-Length: 212
< Content-Type: application/xml
< Server: Microsoft-HTTPAPI/2.0
< x-ms-request-id: 4ca9c883-301e-0045-72f7-f13f4d000000
< x-ms-error-code: ServerBusy
< Date: Wed, 15 Dec 2021 21:04:00 GMT
<
So those would be good errors to retry after some backoff at some layer in the stack.
Interesting that it's a 503 instead of 429. Good to note for https://github.com/gjoseph92/stackstac/issues/18. Retries are definitely a top priority for robustness, I just don't know whether to do them at the dask level (feels more appropriate, but https://github.com/dask/dask/issues/7036 is a blocker) or just within stackstac.
GDAL has a couple config options GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY (https://gdal.org/user/configoptions.html) that we can set. IMO, that's probably best done at the application layer, but I'm not sure. I'll likely set them on the Planetary Computer hub.
Nice, I didn't know about these! That would be very easy to set in stackstac's LayeredEnv
right now. Let's do that.
I think this was fixed by https://github.com/gjoseph92/stackstac/pull/95 and https://github.com/dask/distributed/issues/5552. If those versions aren't already in the production hub, they will be in the next couple weeks (they're in our staging hub, and the workflow that previously errored is now working. Will update in #12).
Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:
In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:
The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x~60000 y~40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.
Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?