microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 9 forks source link

Computations fail with "could not find dependent" #11

Closed MikeBeller closed 2 years ago

MikeBeller commented 2 years ago

Frequently, large dask computations on large clusters seem to fail with "could not find dependent" errors:

/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1811                             exc = CancelledError(key)
   1812                         else:
-> 1813                             raise exception.with_traceback(traceback)
   1814                         raise exc
   1815                     if errors == "skip":

ValueError: Could not find dependent ('broadcast_to-concatenate-where-getitem-4916603adba736a1cf7a63ef1786c8d8', 8, 1, 0, 0).  Check worker logs

In checking the logs, by looking in the master logs for the item ID, and then tracking it down to a specific worker, the worker log looks like this:

distributed.worker - INFO - Start worker at: tls://10.244.47.9:36493
distributed.worker - INFO - Listening to: tls://10.244.47.9:36493
distributed.worker - INFO - dashboard at: 10.244.47.9:8787
distributed.worker - INFO - Waiting to connect to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 8.00 GiB
distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/worker-5bjfqpay
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tls://dask-258014f969ff4b288a395f5e84f2354b.prod:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 0 . Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)" fetch>} for key ('getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0)
distributed.worker - INFO - Dependent not found: ('broadcast_to-concatenate-where-getitem-bceb982d462d09c679da971dd8c4d874', 8, 2, 0, 0) 1 . Asking scheduler
... many repeats
distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration

The general situation is I've stacked a large, multi-tile stack (25 times, 5 bands, x~60000 y~40000) -- so 2.5TB. The cluster size tried is varying from 45 -> 300 cores.

Can you give me general guidance on this sort of error? I can't reproduce it very reliably, so it's hard to provide a "minimal example" that can reproduce the error. Perhaps someone might have some guidance about what is happening and what I could do to explore possible reasons / solutions?

TomAugspurger commented 2 years ago

This is tricky... There are various issues in dask/distributed around this topic (https://github.com/dask/distributed/issues/5172 is the primary one).

Can you provide more information on the behavior of the cluster as this is computing? In particular:

  1. Is the size of your cluster fixed at some number of workers prior to your computation starting?
  2. Are any workers under memory pressure and possibly being paused / killed?
MikeBeller commented 2 years ago

Thanks Tom. I would say that this is typically happening when there is a total computation that may exceed the memory of the cluster. My assumption is that dask.distributed would spill to disk -- but in the issue you reference it does say this happens in cases of high task latency often due to 'spill to disk'. So you may be right on track.

Let me do some more digging and observing during my runs and see if I can get better answers to you questions. Will post back when I have that or better yet an isolated example.

MikeBeller commented 2 years ago

I just posted #12 -- still getting missing dependent errors intermittently. Per your question, example in #12 was with a fixed size cluster and fully stabilized before running the workload. The code there is a repeatable example. If you wanted to run it yourself I'm sure you'd see many things I am missing.

gjoseph92 commented 2 years ago

@MikeBeller what version of distributed is this? I hear unconfirmed rumors that this Could not find dependent issue may be fixed in 2021.11.2 and later. @TomAugspurger maybe you should update dask as well?

MikeBeller commented 2 years ago

@MikeBeller what version of distributed is this? I hear unconfirmed rumors that this Could not find dependent issue may be fixed in 2021.11.2 and later. @TomAugspurger maybe you should update dask as well?

import dask.distributed
dask.distributed.__version__
'2021.08.1'
gjoseph92 commented 2 years ago
I ran this myself and found something like this in the worker logs ``` distributed.worker - INFO - Can't find dependencies {} for key ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 8, 10) distributed.worker - INFO - Dependent not found: ('asset-table-81cb4d5d94619ed532a9b3f16b4a132c', 72, 2) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 1 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 2 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 1 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 1 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 3 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 1 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 2 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 2 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 4 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 2 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 3 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 3 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) 5 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 3 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 4 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 7, 11) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 4 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 4 . Asking scheduler distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) 5 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) 5 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - INFO - Dependent not found: ('where-getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) 5 . Asking scheduler distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 9, 12) distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 5, 7) distributed.worker - INFO - Can't find dependencies {} for key ('getitem-ec652e7693867e86f155aeeecdfc7dde', 2, 10, 10) distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 2550, in handle_missing_dep for dep in deps: RuntimeError: Set changed size during iteration ```

The RuntimeError: Set changed size during iteration is reminiscent of https://github.com/dask/distributed/issues/4371. I don't think that was actually properly fixed, but the handle_missing_dep function in question was removed entirely in dask/distributed#5046, so upgrading to distributed>= 2021.10.0 indeed might eliminate this problem.

I also happened to notice these transient errors from rasterio while reading the data:

RuntimeError('Error opening \'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\': RasterioIOError("\'/vsicurl/https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/PU/2020/01/30/S2A_MSIL2A_20200130T080151_N0212_R035_T35KPU_20201002T151436.SAFE/GRANULE/L2A_T35KPU_A024056_20200130T080800/IMG_DATA/R10m/T35KPU_20200130T080151_B03_10m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D\' not recognized as a supported file format.")')

RuntimeError("Error reading Window(col_off=24576, row_off=12288, width=2048, height=2048) from 'https://sentinel2l2a01.blob.core.windows.net/sentinel2-l2/35/K/QT/2020/02/01/S2B_MSIL2A_20200201T075039_N0212_R135_T35KQT_20200930T092316.SAFE/GRANULE/L2A_T35KQT_A015176_20200201T080648/IMG_DATA/R20m/T35KQT_20200201T075039_SCL_20m.tif?st=2021-12-14T00%3A38%3A52Z&se=2021-12-22T00%3A38%3A52Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2021-12-14T23%3A38%3A52Z&ske=2021-12-16T00%3A38%3A52Z&sks=b&skv=2020-06-12&sig=TTTLkhN3fl5A4qx5gIK0MGBUbjJqi1gft6Xy6Yb0YKs%3D': RasterioIOError('Read or write failed. IReadBlock failed at X offset 48, Y offset 100: IReadBlock failed at X offset 5, Y offset 4: TIFFReadEncodedTile() failed.')")

So I'd probably add this just to ignore them for now.

stackstac.stack(...,
            errors_as_nodata=(
                RasterioIOError('HTTP response code: 404'),
                RuntimeError("Error opening"),
                RuntimeError("Error reading Window")
            ),
TomAugspurger commented 2 years ago

I also happened to notice these transient errors from rasterio while reading the data:

Those are from some rate limits on Azure Blob Storage (account-wide limits, not specific to your requests necessarily). If you enable CPL_CURL_VERBOSE=1 you might see logs like

< HTTP/1.1 503 Egress is over the account limit.
< CPLE_None in VSICURL: Got response_code=503
Content-Length: 212
< Content-Type: application/xml
< Server: Microsoft-HTTPAPI/2.0
< x-ms-request-id: 4ca9c883-301e-0045-72f7-f13f4d000000
< x-ms-error-code: ServerBusy
< Date: Wed, 15 Dec 2021 21:04:00 GMT
< 

So those would be good errors to retry after some backoff at some layer in the stack.

gjoseph92 commented 2 years ago

Interesting that it's a 503 instead of 429. Good to note for https://github.com/gjoseph92/stackstac/issues/18. Retries are definitely a top priority for robustness, I just don't know whether to do them at the dask level (feels more appropriate, but https://github.com/dask/dask/issues/7036 is a blocker) or just within stackstac.

TomAugspurger commented 2 years ago

GDAL has a couple config options GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY (https://gdal.org/user/configoptions.html) that we can set. IMO, that's probably best done at the application layer, but I'm not sure. I'll likely set them on the Planetary Computer hub.

gjoseph92 commented 2 years ago

Nice, I didn't know about these! That would be very easy to set in stackstac's LayeredEnv right now. Let's do that.

TomAugspurger commented 2 years ago

I think this was fixed by https://github.com/gjoseph92/stackstac/pull/95 and https://github.com/dask/distributed/issues/5552. If those versions aren't already in the production hub, they will be in the next couple weeks (they're in our staging hub, and the workflow that previously errored is now working. Will update in #12).