pangeo-forge / gpcp-feedstock

A Pangeo Forge Feedstock for gpcp.
Apache License 2.0
3 stars 2 forks source link

First prod run fail - likely scheduler memory issue #2

Closed cisaacstern closed 2 years ago

cisaacstern commented 2 years ago

@rabernat, our first production run for this feedstock failed 😞 .

The logs here are not useful without a resolution for https://github.com/pangeo-forge/pangeo-forge.org/issues/63 (and with big changes in motion for the backend, this issue is probably not worth addressing at the moment).

A closer look at the logs on the Prefect backend, as well as on our Loki service, reveals tracebacks like this

``` Unexpected error: OSError('Timed out trying to connect to tcp://dask-jovyan-aa3c19e1-7.pangeo-forge-columbia-staging-bakery:8786 after 30 s') Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/tcp.py", line 398, in connect stream = await self.client.connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect af, addr, stream = await connector.start(connect_timeout=timeout) asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 492, in wait_for fut.result() asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 284, in connect comm = await asyncio.wait_for( File "/srv/conda/envs/notebook/lib/python3.9/asyncio/tasks.py", line 494, in wait_for raise exceptions.TimeoutError() from exc asyncio.exceptions.TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/engine/flow_runner.py", line 661, in get_flow_run_state assert isinstance(final_states, dict) File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 137, in __exit__ self.gen.throw(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/prefect/executors/dask.py", line 237, in start self._post_start_yield() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 445, in __exit__ super().__exit__(typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 468, in __exit__ return self.sync(self.__aexit__, typ, value, traceback) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 258, in sync return sync(self.loop, func, *args, **kwargs) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync raise exc.with_traceback(tb) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/utils.py", line 315, in f result[0] = yield future File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/gen.py", line 762, in run value = future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback ret = callback() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result future.result() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/cluster.py", line 477, in __aexit__ await f File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 417, in _close await self._correct_state() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/deploy/spec.py", line 332, in _correct_state_internal await self.scheduler_comm.retire_workers(workers=list(to_close)) File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 817, in send_recv_from_rpc comm = await self.live_comm() File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/core.py", line 774, in live_comm comm = await connect( File "/srv/conda/envs/notebook/lib/python3.9/site-packages/distributed/comm/core.py", line 308, in connect raise OSError( OSError: Timed out trying to connect to tcp://dask-jovyan-aa3c19e1-7.pangeo-forge-columbia-staging-bakery:8786 after 30 s ```

These connection timeout errors smell a lot like a Prefect scheduler killed by too many store_chunk tasks. This hypothesis is supported by the fact that this recipe defines 9226 store_chunk tasks

Screen Shot 2022-07-13 at 2 04 40 PM

In the case of https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1108812665, store_chunk tasks needed to be reduced into the neighborhood of ~1500 tasks to get the production deployment to succeed.

rabernat commented 2 years ago

😢

I was so excited about this.

rabernat commented 2 years ago

FWIW, these are tiny files.

cisaacstern commented 2 years ago

Here's an interesting idea, I just merged https://github.com/pangeo-forge/registrar/pull/48, so we could try running this on Dataflow. 🚀

Here's how Dataflow routing works right now:

  1. Everything goes to Prefect by default
  2. For recipe tests from PRs, PRs labeled as dev are routed to Dataflow
  3. For production runs, tags which include the substring beta are routed to Dataflow

So if I make a tag with the substring beta in it, we should get a deployment to Dataflow 🤔

cisaacstern commented 2 years ago

Hmm I just made a release but looks like that pathway doesn't work anymore, following migration from tag events to push events as the trigger for production runs. I'm going to think for a moment what a lightweight way of testing this on Dataflow might look like.

rabernat commented 2 years ago

That sounds promising. Let me know how I can help.

cisaacstern commented 2 years ago

Superseded by #4.