Diagnosing first deployment failure

cisaacstern commented 2 years ago

Our first deployment of this feedstock failed

https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/deployments

despite the fact that

A pruned test run of this NOAA Coastwatch recipe recently succeeded https://github.com/pangeo-forge/staged-recipes/pull/127#issuecomment-1098438284
The deployment registration logic succeeded on a test feedstock https://github.com/pangeo-forge/test-000-dataset-feedstock/deployments

The fact that the recipe succeeded in pruned form, and that we know the deployment logic is sound, suggests that the problem is one of scale (something like hitting concurrency limits, etc.).

The recipe run page for this failed run is linked via the Deployed text on the deployments page linked above

The logs there are not especially informative (I will raise a separate issue about filtering on the website repo). I will do some digging on the Prefect/Grafana backend to surface some more informative logs and report back here. @sharkinsspatial, I may lean on your experience running recipes at scale, which I expect may help us get to bottom of this.

cc @rwegener2 @rabernat

sharkinsspatial commented 2 years ago

@cisaacstern I'm on vacation and travelling today but I'll be back in the office on Tuesday and Wednesday if you would like to pair on log filtering. The recipes at scale issues that we have encountered can be broadly categorized into 2 areas, graph serialization issues related to graph size and fsspec/hdf5 concurrency issues. We can begin by investigating these areas first.

cisaacstern commented 2 years ago

@sharkinsspatial pasting and replying to your question in https://github.com/pangeo-forge/pangeo-forge.org/issues/63#issuecomment-1101629567 here:

@cisaacstern Is the Prefect dashboard for this run publicly accessible? This is often helpful in diagnosing fsspec concurrency issues as these deadlocks don't appear as failures but tasks stuck in a pending state. The Prefect dashboard provides a mechanism for filtering on task state to easily identify these deadlocks.

The dashboard is not publicly accessible. Filtering the logs there by level ERROR, we get a few different traceback types:

fsspec.exceptions.FSTimeoutError

``` Task 'store_chunk[0]': Exception encountered during task execution! Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/aiobotocore/response.py", line 51, in read chunk = await self.__wrapped__.read(amt if amt is not None else -1) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/aiohttp/streams.py", line 375, in read block = await self.readany() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/aiohttp/streams.py", line 397, in readany await self._wait("readany") File "/srv/conda/envs/notebook/lib/python3.9/site-packages/aiohttp/streams.py", line 304, in _wait await waiter aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py", line 25, in _runner result[0] = await coro File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py", line 402, in _cat raise ex File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 442, in wait_for return await fut File "/srv/conda/envs/notebook/lib/python3.9/site-packages/s3fs/core.py", line 895, in _cat_file data = await resp["Body"].read() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/aiobotocore/response.py", line 53, in read raise AioReadTimeoutError(endpoint_url=self.__wrapped__.url, aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr/analysis_error/0.0.0" The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 98, in wrapper File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 573, in store_chunk zarr_array[zarr_region] = data File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1224, in __setitem__ self.set_basic_selection(selection, value, fields=fields) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1319, in set_basic_selection return self._set_basic_selection_nd(selection, value, fields=fields) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1610, in _set_basic_selection_nd self._set_selection(indexer, value, fields=fields) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1682, in _set_selection self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values, File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1871, in _chunk_setitems cdatas = [self._process_for_setitem(key, sel, val, fields=fields) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1871, in cdatas = [self._process_for_setitem(key, sel, val, fields=fields) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/zarr/core.py", line 1932, in _process_for_setitem cdata = self.chunk_store[ckey] File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/mapping.py", line 135, in __getitem__ result = self.fs.cat(k) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py", line 91, in wrapper return sync(self.loop, func, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/asyn.py", line 69, in sync raise FSTimeoutError from return_result fsspec.exceptions.FSTimeoutError ```

distributed.comm.core.CommClosedError

``` Task 'store_chunk[16]': Exception encountered during task execution! Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 98, in wrapper File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 569, in store_chunk with lock_for_conflicts(lock_keys, timeout=config.lock_timeout): File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts acquired = lock.acquire(timeout=timeout) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire result = self.client.sync( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync return sync( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc result = await send_recv(comm=comm, op=key, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv response = await comm.read(deserializers=deserializers) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read convert_stream_closed_error(self, e) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed ```

OSError: Timed out trying to connect

``` Unexpected error: OSError('Timed out trying to connect to tcp://dask-jovyan-d65e24e3-0.pangeo-forge-columbia-staging-bakery:8786 after 30 s') Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 398, in connect stream = await self.client.connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect af, addr, stream = await connector.start(connect_timeout=timeout) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 492, in wait_for fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 284, in connect comm = await asyncio.wait_for( File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 494, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/flow_runner.py", line 661, in get_flow_run_state assert isinstance(final_states, dict) File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 137, in __exit__ self.gen.throw(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/executors/dask.py", line 237, in start self._post_start_yield() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 445, in __exit__ super().__exit__(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 468, in __exit__ return self.sync(self.__aexit__, typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 258, in sync return sync(self.loop, func, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback ret = callback() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 477, in __aexit__ await f File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 417, in _close await self._correct_state() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 332, in _correct_state_internal await self.scheduler_comm.retire_workers(workers=list(to_close)) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 817, in send_recv_from_rpc comm = await self.live_comm() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 774, in live_comm comm = await connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 308, in connect raise OSError( OSError: Timed out trying to connect to tcp://dask-jovyan-d65e24e3-0.pangeo-forge-columbia-staging-bakery:8786 after 30 s ```

My guess is that fsspec.exceptions.FSTimeoutError (which arose on the first store_chunk task) is the root problem, and that the others are symptoms of that? As for why that error occurred, I'm not clear on that yet. Certainly welcome any and all suggestions.

cisaacstern commented 2 years ago

As shown in the details dropdown for fsspec.exceptions.FSTimeoutError above, this error was caused by

aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr/analysis_error/0.0.0"

The same error also appears in the Prefect logs for another url (which I had not previously pasted to this thread):
aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr/analysed_sst/15.0.0"

As seen in the traceback pasted in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1101651101, this is trigged by zarr_array[zarr_region] = data

# pangeo_forge_recipes.recipes.xarray_zarr

target_mapper = config.storage_config.target.get_mapper()
# etc

zgroup = zarr.open_group(target_mapper)
for vname, var_coded in ds_chunk.variables.items():
    zarr_array = zgroup[vname]
    # etc
        zarr_array[zarr_region] = data

Based on the aiobotocore error, this appears to be related to https://stackoverflow.com/questions/53974790, in which @martindurant recommended decreasing block_size to mitigate the affect of slow network speeds. Notes on this:

The two problematic URLs above are rather small (28 MB and 15 MB, respectively).
Our block size is not set above the s3fs default_block_size of 5 MB, which seems modest to begin with.
If we did want to lower the block size, I'm not sure exactly how we'd go about doing this. The files which are timing out live in the S3 target. Currently, we don't pass any customizable kwargs to the instantiation of the FSSpecTarget. And relatedly, because this I/O is mediated by zarr, I admit I'm a bit unclear as to what file system is actually handling it.

Martin, questions for you:

Is the FSSpecTarget's S3FileSystem, from which the target_mapper is assigned by the line
```
target_mapper = config.storage_config.target.get_mapper()
```
handling I/O for the zgroup assigned as
```
zgroup = zarr.open_group(target_mapper)
```
?
If so, is finding a mechanism to decrease the default_block_size on this S3FileSystem a sensible next step? It would seem to me like a GCS cluster in us-central1 should be able to read 15-28 MB files from the OSN S3 bucket (also in central, IIUC) in 5 MB blocks without hitting the default timeout of 5 minutes?

rabernat commented 2 years ago

This feels like an OSN error. Perhaps there was a glitch on their end. Two thoughts:

Are we using retries on these tasks?
Can you try just running the recipe again? (If so, would it reuse the same cache location so it doesn't have to re-download all the files?)

cisaacstern commented 2 years ago

Can you try just running the recipe again? (If so, would it reuse the same cache location so it doesn't have to re-download all the files?)

Yes, the cache would be re-used. I'm not optimistic that a re-run will get us past this, given that this recipe was actually already run once before, and it encountered the same error the first time. The first run was the one linked in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/1#issue-1205106780. As described in https://github.com/pangeo-forge/pangeo-forge.org/issues/63, the pangeo-forge.org logs are not helpful there, but digging into the Prefect backend reveals the same store_chunk ➡️ zarr_array[zarr_region] = data ➡️ fsspec.exceptions.FSTimeoutError that we're discussing here.

Are we using retries on these tasks?

I'm actually unclear on this. My assumption would be no, we are not using retries. I base this on the fact that I don't see evidence of them in the logs for the runs under consideration. And, even more convincingly, that the xarray_zarr_recipe_compiler does not pass any annotations, which is how retries are defined by the prefect compiler.

That being said, in a separate run, I did observe what looked like retries of prepare_target, described in https://github.com/pangeo-forge/staged-recipes/pull/127#issuecomment-1076893854. I'm not sure where these retries were configured, but they don't seem to have been employed for the failing store_chunk tasks we're currently discussing.

Seems like a good next step would be supporting configuration of retries for the xarray_zarr_recipe_compiler? The easiest way to make this work for the registrar would be:

Make + merge a PR supporting retries to pangeo-forge-recipes
Release pangeo-forge-recipes 0.8.3
Build and push a new Bakery image for pangeo-forge-recipes==0.8.3
Build and push a new Registrar image which is aware of this 0.8.3 release
Upgrade pangeo_forge_version in this feedstock's meta.yaml to 0.8.3; pushing this change to main on this feedstock will trigger a new recipe run

rabernat commented 2 years ago

Let me know if you think it would make sense to reach out the the OSN pod maintainers to do further debugging.

cisaacstern commented 2 years ago

Thanks let's keep that in mind as a next recourse if retries don't help. Is it possible there are somehow... concurrency limits on the bucket?

sharkinsspatial commented 2 years ago

@cisaacstern Previously I had been approaching retry of failed tasks at the Prefect task level during registration after compilation but it seems as if this configuration has been shifted directly to pangeo-forge-recipes via annotations.

Where do we feel this configuration logic should live and how should this stage annotation be created and managed for different compilation strategies?

Additionally, when using Prefect as the compilation target we may want to also consider including retry_delay for the task args to avoid overrunning S3 requests. https://docs.prefect.io/api/latest/core/task.html#task-2

cisaacstern commented 2 years ago

@sharkinsspatial, thanks so much for chiming in. It's very useful to know that you'd been using retries (with delays!) successfully in your prior work.

Where do we feel this configuration logic should live and how should this stage annotation be created and managed for different compilation strategies?

This is a very good question which I imagine will require a slightly longer iteration cycle for us to get right.

To begin, I'm going to try just patching these retry + delay kwargs into a pangeo-forge-recipes branch, and then re-running this recipe with that branch, just so we can see if this indeed resolves our problem. I'll report back on this thread with results. If this works, we can circle back and consider the best way to specify this config.

cisaacstern commented 2 years ago

After a series of steps reflected in the linked PRs above, I was able to re-deploy this feedstock with pangeo-forge-recipes installed from a branch with this hardcoded Prefect retry logic, referenced from Sean's links in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1102931123.

We are still failing (and the linked recipe run page logs are still unhelpful without filtering), but perhaps this could be construed as progress in that I'm not seeing the fsspec.exceptions.FSTimeoutError in the Prefect logs. The main error we're now hitting appears to be a new flavor of the distributed.comm.core.CommClosedError reported in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1101651101:

ConnectionResetError: [Errno 104] Connection reset by peer

``` Task 'store_chunk[40]': Exception encountered during task execution! Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/iostream.py", line 867, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/iostream.py", line 1140, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "/usr/local/lib/python3.9/site-packages/registrar/flow.py", line 113, in wrapper File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 631, in store_chunk with lock_for_conflicts(lock_keys, timeout=config.lock_timeout): File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__ return next(self.gen) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/utils.py", line 106, in lock_for_conflicts acquired = lock.acquire(timeout=timeout) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/lock.py", line 137, in acquire result = self.client.sync( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/client.py", line 868, in sync return sync( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 895, in send_recv_from_rpc result = await send_recv(comm=comm, op=key, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 672, in send_recv response = await comm.read(deserializers=deserializers) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 221, in read convert_stream_closed_error(self, e) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer ```

Just to note, this recipe does generate 7134 store_chunk tasks:

len(list(recipe.iter_chunks()))  # --> 7134

@sharkinsspatial you mentioned "overrunning S3 requests" above. Does this many concurrent writes seem like it might cause that problem? If so, is our best solution limiting concurrency?

sharkinsspatial commented 2 years ago

@cisaacstern Given the variable partition prefixes where you are writing data and the volume of concurrent requests (related to the number of workers in your cluster) I don't believe you should be experiencing any S3 request throttling.

Seeing only ConnectionResetError in your logs would lead me to believe that the worker running store_chunk[40] was killed. I'm assuming that the filter you are using with Grafana includes both the scheduler and worker logs if you are using this approach.

My initial hunch/guess is that your worker pod is being evicted/restarted https://distributed.dask.org/en/stable/killed.html by k8s due to OOM. We can use kubectl or Lens to view the cluster health information in more detail and see if this is indeed the case.

cisaacstern commented 2 years ago

Thanks to Sean for a helpful offline sync. Recapping our discussion:

It looks like the scheduler is being killed due to OOM, which means this is likely related to https://github.com/pangeo-forge/pangeo-forge-recipes/issues/208, which still appears to not have been solved in Prefect.
Our e2-standard-4 instances appear to allow a maximum of 16 GB memory (not the 4 GB we'd previously thought)
We can selectively increase scheduler memory allocation at registration time as a workaround for scheduler OOM.

I'll bump up scheduler memory now, and then re-deploy the recipe, and we'll see what happens.

rabernat commented 2 years ago

I noticed that the recipe is using locking, and that is where the timeout happens. That doesn't make sense to me, given the recipe itself - https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/main/feedstock/recipe.py. No locks should be required.

cisaacstern commented 2 years ago

IIUC, we use locks whenever we're in a distributed context, which... we are on Prefect?

rabernat commented 2 years ago

But there should not be any actual conflicts for this type of recipe

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/9c731846ff09761023212a6fea0ea3c6d50d228b/pangeo_forge_recipes/utils.py#L102-L103

rabernat commented 2 years ago

Ah I think I understand! It's because we specify {'time': 2} in target chunks

https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/a26d7fc5fd6bee2a58b865e748d31c4b95dee60c/feedstock/recipe.py#L39-L42

But only 1 item per file

https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/a26d7fc5fd6bee2a58b865e748d31c4b95dee60c/feedstock/recipe.py#L36

We should add inputs_per_chunk=2 to the recipe constructor.

cisaacstern commented 2 years ago

We should add inputs_per_chunk=2 to the recipe constructor.

I'll do this now, before increasing scheduler memory, to see if that resolves the problem.

rabernat commented 2 years ago

To be clear, the locking should also work. But this might unblock us for now.

cisaacstern commented 2 years ago

Merging #4 triggered a new deployment linked to recipe-run/36, which is currently in progress.

Adding inputs_per_chunk=2 has cut the number of store_chunk tasks in half

len(list(recipe.iter_chunks()))  # --> now returns 3567

whereas it was previously 7134. This may be in our favor.

cisaacstern commented 2 years ago

I was optimistic about this last deployment, because we made it through around 2000 of the ~3500 store_chunk tasks without failure. Then the run failed 😐 . Here are as much of the logs as I'm able to pull from Grafana (a 1000 line limit is being imposed somewhere, that I don't understand how to increase):

https://gist.github.com/cisaacstern/7c3245720efc4da8e2b3cafdaaeca57a

On store_chunk[1852] we see a distributed.worker - WARNING - Heartbeat to scheduler failed. I am still learning how to conclusively confirm that the the scheduler died due to OOM (there appear to be ways to do this at the k8 level that I haven't yet fully grokked). In the absence of conclusive evidence, this warning does appear to suggest the type of scheduler memory issues mentioned in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1104134323.

While throwing memory at the scheduler is not a great long-term solution, in the absence of an upstream fix in Prefect, I'm curious what increasing scheduler memory may do here. As an experiment, I will try re-running this with increased scheduler memory, which we can always dial back.

Conceivably, the reduced task graph size resulting from inputs_per_chunk allowed us to get further than we had in the past, but it's possible we were still under-provisioned on memory. This upcoming experiment should be illuminating.

cisaacstern commented 2 years ago

As recorded in the growing series of failed deployments, this problem was not solved by increasing scheduler memory alone.

Most recently:

OSError: Timed out trying to connect

``` Unexpected error: OSError('Timed out trying to connect to tcp://dask-jovyan-dd7e32e7-9.pangeo-forge-columbia-staging-bakery:8786 after 30 s') Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 398, in connect stream = await self.client.connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect af, addr, stream = await connector.start(connect_timeout=timeout) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 492, in wait_for fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 284, in connect comm = await asyncio.wait_for( File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 494, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/flow_runner.py", line 661, in get_flow_run_state assert isinstance(final_states, dict) File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 137, in __exit__ self.gen.throw(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/executors/dask.py", line 237, in start self._post_start_yield() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 445, in __exit__ super().__exit__(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 468, in __exit__ return self.sync(self.__aexit__, typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 258, in sync return sync(self.loop, func, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback ret = callback() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 477, in __aexit__ await f File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 417, in _close await self._correct_state() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 332, in _correct_state_internal await self.scheduler_comm.retire_workers(workers=list(to_close)) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 817, in send_recv_from_rpc comm = await self.live_comm() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 774, in live_comm comm = await connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 308, in connect raise OSError( OSError: Timed out trying to connect to tcp://dask-jovyan-dd7e32e7-9.pangeo-forge-columbia-staging-bakery:8786 after 30 s ```

Assuming tcp://dask-jovyan-dd7e32e7-9.pangeo-forge-columbia-staging-bakery:8786 is the scheduler (there is definitely a way to confirm this with the k8s tooling that I haven't yet grasped), then this is looking very convincingly like the previously referenced graph size / memory leak problem being tracked in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/208.

So I have a workaround idea which is not too hacky. Right now we're generating ~3000 store_chunk tasks which create ~100 MB chunks. If we double the chunk size to 200 MB, we can halve the number of store_chunk tasks, and get a smaller graph which possibly can execute without failing. I'm going to try that now.

cisaacstern commented 2 years ago

Reducing the number of store_chunk tasks to ~1500 appears to have resulted in a successful deployment.

@rwegener2, does this dataset look as you'd expect it to:

import fsspec
import xarray as xr

dataset_public_url = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr'
mapper = fsspec.get_mapper(dataset_public_url)
ds = xr.open_zarr(mapper, consolidated=True)
ds

``` Dimensions: (time: 7134, lat: 3600, lon: 7200) Coordinates: * lat (lat) float32 -89.97 -89.93 -89.88 ... 89.88 89.93 89.97 * lon (lon) float32 -180.0 -179.9 -179.9 ... 179.9 179.9 180.0 * time (time) datetime64[ns] 2002-09-01T12:00:00 ... 2022-03-2... Data variables: analysed_sst (time, lat, lon) float32 dask.array analysis_error (time, lat, lon) float32 dask.array mask (time, lat, lon) float32 dask.array sea_ice_fraction (time, lat, lon) float32 dask.array Attributes: (12/47) Conventions: CF-1.4, Unidata Observation Dataset v1.0 Metadata_Conventions: Unidata Observation Dataset v1.0 acknowledgment: NOAA/NESDIS cdm_data_type: grid comment: The Geo-Polar Blended Sea Surface Temperature... creator_email: andy.harris@noaa.gov ... ... summary: An SST estimation scheme which combines multi... time_coverage_end: 20020902T000000Z time_coverage_start: 20020901T000000Z title: Analysed blended sea surface temperature over... uuid: 7c4fc02a-1034-4021-be6f-c3ce858fd33a westernmost_longitude: -180.0 ```

?

On a technical note, I'm pretty confused as to why the chunks in this dataset are only of length 2 in the time dimension:

When the latest commit https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/commit/32ba8c8f6a639975a1061ece699ac2f053cb8d02 clearly specifies 4 time steps per chunk:

https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/32ba8c8f6a639975a1061ece699ac2f053cb8d02/feedstock/recipe.py#L42

And this commit is the Head SHA specified in the recipe run for this deployment.

rabernat commented 2 years ago

Amazing! Great progress Charles!

On a technical note, I'm pretty confused as to why the chunks in this dataset are only of length 2 in the time dimension:

This certainly seems inconsistent. Are you sure you weren't pointing it at an older Zarr store that already had chunks initialized?

For debugging purposes, it can be useful to open the store with Zarr directly (rather than Xarray).

cisaacstern commented 2 years ago

Are you sure you weren't pointing it at an older Zarr store that already had chunks initialized?

Oh good call, this is a distinct possibility.

For debugging purposes, it can be useful to open the store with Zarr directly (rather than Xarray).

Interesting. What does zarr tell us that xarray does not?

cisaacstern commented 2 years ago

I'm going to close this because I believe we've established that the diagnosis is the above-linked Prefect / Dask graph size issue.

rwegener2 commented 2 years ago

Wow, @cisaacstern, this is great!

The output data looks normal to me. I agree that the chunking is not what I expected, but I'm fine with it the way it is.

Thanks so much for all your work on this!

pangeo-forge / noaa-coastwatch-geopolar-sst-feedstock

Diagnosing first deployment failure #2