Open andylizf opened 2 weeks ago
@cblmemo Could you check this issue when you get a chance? Thanks!
Thanks @andylizf ! Those are valuable observations. Could you help submitting a PR for this?
For solution 3, we might want not to use broad exception if possible. Can we only catch the error raised by .terminate()
call?
Description
There are multiple concurrency issues in
sky/serve/replica_managers.py
that can lead to inconsistent replica states and unexpected exceptions. The specific problems are outlined below with following code snippets as evidence.https://github.com/skypilot-org/skypilot/blob/fdd68b209ee74f9282fac5c6834907d5fe72d255/sky/serve/replica_managers.py#L688-L695
https://github.com/skypilot-org/skypilot/blob/fdd68b209ee74f9282fac5c6834907d5fe72d255/sky/serve/replica_managers.py#L832-L873
1. Interrupted Status Overwritten
The
_terminate_replica
method sets thesky_launch_status
toINTERRUPTED
. However, if_refresh_process_pool
runs concurrently, it may overwrite this status before the process is properly terminated, leading to incorrect status reporting.2.
KeyError
When Deleting Process Handle TwiceAfter
_terminate_replica
deletes the replica ID from_launch_process_pool
,_refresh_process_pool
may attempt to delete the same replica ID again, resulting in aKeyError
since the key no longer exists.3. Exception When Terminating an Already Completed Process
The
_terminate_replica
method checks iflaunch_process.is_alive()
before attempting to terminate it. However, between theis_alive()
check and theterminate()
call, the process might complete, causingterminate()
to fail and potentially raising an exception.Proposed Solution
Implement Locking Mechanism:
@with_lock
decorator) to synchronize access to shared resources like_terminate_replica
.Safe Deletion:
pop
with default values when deleting from dictionaries to avoidKeyError
.Add Exception Handling:
terminate
andjoin
calls in try-except blocks to handle cases where the process may have already completed.