modal-labs / modal-client

Python client library for Modal
https://modal.com/docs
Apache License 2.0
242 stars 31 forks source link

Local client apps don't terminate properly on Windows #1227

Open mrticker opened 5 months ago

mrticker commented 5 months ago

Setup: Windows 10, running python modal_app.py

After stopping an app through the web interface, the local command line application does not terminate. It keeps displaying messages like this:

2024-01-23T16:31:29+0100 Loop attempt failed for _run_stub.<locals>.<lambda> (time_elapsed=0.253079891204834)
Traceback (most recent call last):
  File "C:\usr\Miniconda3\lib\site-packages\modal_utils\async_utils.py", line 178, in loop_coro
    await asyncio.wait_for(async_f(), timeout=timeout)
  File "C:\usr\Miniconda3\lib\asyncio\tasks.py", line 479, in wait_for
    return fut.result()
  File "C:\usr\Miniconda3\lib\site-packages\modal\runner.py", line 33, in _heartbeat
    await retry_transient_errors(client.stub.AppHeartbeat, request, attempt_timeout=HEARTBEAT_TIMEOUT)
  File "C:\usr\Miniconda3\lib\site-packages\modal_utils\grpc_utils.py", line 258, in retry_transient_errors
    raise exc
  File "C:\usr\Miniconda3\lib\site-packages\modal_utils\grpc_utils.py", line 255, in retry_transient_errors
    return await fn(*args, metadata=metadata, timeout=timeout)
  File "C:\usr\Miniconda3\lib\site-packages\grpclib\client.py", line 882, in __call__
    reply = await stream.recv_message()
  File "C:\usr\Miniconda3\lib\site-packages\grpclib\client.py", line 425, in recv_message
    await self.recv_initial_metadata()
  File "C:\usr\Miniconda3\lib\site-packages\grpclib\client.py", line 393, in recv_initial_metadata
    self._raise_for_grpc_status(status, message, details)
  File "C:\usr\Miniconda3\lib\site-packages\grpclib\client.py", line 345, in _raise_for_grpc_status
    raise GRPCError(status, message, details)
grpclib.exceptions.GRPCError: (<Status.FAILED_PRECONDITION: 9>, 'App state is APP_STATE_STOPPED', No/ Running (1 containers finished)...
irfansharif commented 5 months ago

Hey! What does modal --version and python -V show for you?

mrticker commented 5 months ago
>modal --version
modal client version: 0.56.4649

>python -V
Python 3.9.15

updated modal through pip:

>modal --version
modal client version: 0.56.4763

The issue is still there.

thundergolfer commented 4 months ago

@freider are you able to check if this is a still a problem on Windows with the current modal client version?

freider commented 4 months ago

Checked - this is actually a problem on non-Windows versions too, specifically when terminating a running app from another context than the running CLI.

The root cause seems to be that the main function invocation will sit in a long-polling FunctionGetOutputs and wait for a result that will never come in since all tasks for the app have been killed, and in the meanwhile the app heartbeat will start to repeatedly fail since it gets an error once the app is shut down.

There are a few things I'd like to do to fix this "properly":

  1. When an app shuts down and terminates workers, it should probably publish failure results on all ongoing function calls, so that any stalling FunctionGetOutputs gets that back
  2. When run_stub gets an app_done marker in the logs (or any other "shutdown" trigger), it should immediately trigger an asyncio termination event that can be easily listened to by other parts of the system that are doing longpolls or heartbeat loops, and shut those down gracefully. This could replace the janky _set_ignore_cancellation we've had on the Function object.
  3. (Maybe) When an error about the app stopping is received, that could be swallowed and let the context shut down (and possibly set the termination event above). Hopefully this shouldn't actually need to happen, since everything should have been notified already of the app shutting down.
freider commented 4 months ago

Fix using method 2 above in: https://github.com/modal-labs/modal-client/pull/1452