oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

Prefect 2 agent fails "randomly" when run for "longtime" (at least once a day!) #14

Closed SorooshMani-NOAA closed 8 months ago

SorooshMani-NOAA commented 1 year ago

The following error is seen:

line 1682, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/httpx/_client.py",
line 1719, in _send_single_request
    response = await transport.handle_async_request(request)
  File
"/opt/conda/envs/odssm/lib/python3.10/site-packages/httpx/_transports/default.py
", line 352, in handle_async_request
    with map_httpcore_exceptions():
  File "/opt/conda/envs/odssm/lib/python3.10/contextlib.py", line 153, in
__exit__
    self.gen.throw(typ, value, traceback)
  File
"/opt/conda/envs/odssm/lib/python3.10/site-packages/httpx/_transports/default.py
", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.LocalProtocolError: Invalid input ConnectionInputs.SEND_HEADERS in state
ConnectionState.CLOSED

Traceback (most recent call last):
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/cli/_utilities.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 260, in coroutine_wrapper
    return call()
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 245, in __call__
    return self.result()
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 173, in result
    return self.future.result(timeout=timeout)
  File "/opt/conda/envs/odssm/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/opt/conda/envs/odssm/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/_internal/concurrency/calls.py", line 218, in _run_async
    result = await coro
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/cli/agent.py", line 189, in start
    async with anyio.create_task_group() as tg:
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/opt/conda/envs/odssm/lib/python3.10/site-packages/prefect/utilities/services.py", line 104, in critical_service_loop
    raise RuntimeError("Service exceeded error threshold.")
RuntimeError: Service exceeded error threshold.

I suspect that it's the Prefect 2 Cloud availability that results in this (?)

SorooshMani-NOAA commented 1 year ago

Potentially this is the reason that the waiter for schism run fails to stop the Flow too!

SorooshMani-NOAA commented 1 year ago

This might be due to errors that happen when flows are "Deleted" or "Cancelled" but stay in "Cancelling" status. There's also now a response on https://discourse.prefect.io/t/prefect-agent-on-docker-error/2905/2 which states:

Hi,

I had a similar problem, I think. I asked on Slack, and Prefect folks advised me to do this:

Hi Gosia, try adding PREFECT_API_ENABLE_HTTP2=False to your env variables for the agent

This solved my problem.

SorooshMani-NOAA commented 8 months ago

For now we abandoned using Prefect workflow