ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.97k stars 5.45k forks source link

[Serve] [Core] Terminating a Serve deployment that sets `object_store_memory` logs native errors #45780

Open shrekris-anyscale opened 4 weeks ago

shrekris-anyscale commented 4 weeks ago

What happened + What you expected to happen

Terminating a Serve deployment that sets object_store_memory logs native errors. See reproduction script for example.

Versions / Dependencies

Ray on the latest master.

Reproduction script

# Filename: repro.py
import ray
from ray import serve

ray.init()

@serve.deployment(ray_actor_options={"object_store_memory": 1024})
def f():
    return "hello"

h = serve.run(f.bind())
assert h.remote().result() == "hello"

serve.shutdown()
ray.shutdown()

Output:

% python repro.py 
2024-06-06 15:26:09,076 INFO worker.py:1761 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(ProxyActor pid=82609) INFO 2024-06-06 15:26:11,168 proxy 10.103.209.172 proxy.py:1165 - Proxy starting on node 04f7f93cf44cbde8c1aa18d24898ee32dde17da40b6124c08ede9b0d (HTTP port: 8000).
2024-06-06 15:26:11,257 INFO handle.py:126 -- Created DeploymentHandle 'r64girlh' for Deployment(name='f', app='default').
2024-06-06 15:26:11,257 INFO handle.py:126 -- Created DeploymentHandle '679oq3kv' for Deployment(name='f', app='default').
(ServeController pid=82599) INFO 2024-06-06 15:26:11,344 controller 82599 deployment_state.py:1598 - Deploying new version of Deployment(name='f', app='default') (initial target replicas: 1).
(ServeController pid=82599) INFO 2024-06-06 15:26:11,446 controller 82599 deployment_state.py:1844 - Adding 1 replica to Deployment(name='f', app='default').
2024-06-06 15:26:12,265 INFO handle.py:126 -- Created DeploymentHandle 'd0ywet2o' for Deployment(name='f', app='default').
2024-06-06 15:26:12,265 INFO api.py:584 -- Deployed app 'default' successfully.
2024-06-06 15:26:12,270 INFO pow_2_scheduler.py:260 -- Got updated replicas for Deployment(name='f', app='default'): {'so674uw0'}.
(ServeReplica:default:f pid=82621) INFO 2024-06-06 15:26:12,275 default_f so674uw0 95ba701e-d5e5-45be-84f8-74717de37804 replica.py:373 - __CALL__ OK 1.3ms
2024-06-06 15:26:12,382 INFO pow_2_scheduler.py:260 -- Got updated replicas for Deployment(name='f', app='default'): set().
(ServeController pid=82599) INFO 2024-06-06 15:26:12,380 controller 82599 deployment_state.py:1860 - Removing 1 replica from Deployment(name='f', app='default').
(ServeController pid=82599) INFO 2024-06-06 15:26:14,410 controller 82599 deployment_state.py:2182 - Replica(id='so674uw0', deployment='f', app='default') is stopped.
(raylet) [2024-06-06 15:26:14,424 C 82586 97027363] (raylet) local_resource_manager.cc:109:  Check failed: (_left_ >= _right_)  21474836480000 vs 21474846720000
(raylet) *** StackTrace Information ***
(raylet) 0   raylet                              0x0000000101279f7c _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<()
(raylet) 1   raylet                              0x000000010127caf4 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog()
(raylet) 2   raylet                              0x0000000100a4ab88 _ZN3ray20LocalResourceManager25FreeTaskResourceInstancesENSt3__110shared_ptrINS_21TaskResourceInstancesEEEb + 800 ray::LocalResourceManager::FreeTaskResourceInstances()
(raylet) 3   raylet                              0x0000000100a4b7a0 _ZN3ray20LocalResourceManager22ReleaseWorkerResourcesENSt3__110shared_ptrINS_21TaskResourceInstancesEEE + 76 ray::LocalResourceManager::ReleaseWorkerResources()
(raylet) 4   raylet                              0x000000010082ba6c _ZN3ray6raylet16LocalTaskManager22ReleaseWorkerResourcesENSt3__110shared_ptrINS0_15WorkerInterfaceEEE + 1148 ray::raylet::LocalTaskManager::ReleaseWorkerResources()
(raylet) 5   raylet                              0x000000010083dcec _ZN3ray6raylet11NodeManager16DisconnectClientERKNSt3__110shared_ptrINS_16ClientConnectionEEENS_3rpc14WorkerExitTypeERKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEEPKNS8_12RayExceptionE + 6280 ray::raylet::NodeManager::DisconnectClient()
(raylet) 6   raylet                              0x0000000100847b30 _ZN3ray6raylet11NodeManager30ProcessDisconnectClientMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEEPKh + 432 ray::raylet::NodeManager::ProcessDisconnectClientMessage()
(raylet) 7   raylet                              0x0000000100846120 _ZN3ray6raylet11NodeManager20ProcessClientMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEExPKh + 940 ray::raylet::NodeManager::ProcessClientMessage()
(raylet) 8   raylet                              0x00000001008f8b88 _ZNSt3__110__function6__funcIZN3ray6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEE3$_2NS_9allocatorISA_EEFvNS_10shared_ptrINS2_16ClientConnectionEEExRKNS_6vectorIhNSB_IhEEEEEEclEOSF_OxSK_ + 52 std::__1::__function::__func<>::operator()()
(raylet) 9   raylet                              0x0000000100c474f4 _ZN3ray16ClientConnection14ProcessMessageERKN5boost6system10error_codeE + 924 ray::ClientConnection::ProcessMessage()
(raylet) 10  raylet                              0x0000000100c5d580 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 228 EventTracker::RecordExecution()
(raylet) 11  raylet                              0x0000000100c51104 _ZN5boost4asio6detail7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEENS0_17mutable_buffers_1EPKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EclESG_mi + 576 boost::asio::detail::read_op<>::operator()()
(raylet) 12  raylet                              0x0000000100c51410 _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EES8_E11do_completeEPvPNS1_19scheduler_operationESJ_m + 288 boost::asio::detail::reactive_socket_recv_op<>::do_complete()
(raylet) 13  raylet                              0x000000010139a3cc _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 664 boost::asio::detail::scheduler::do_run_one()
(raylet) 14  raylet                              0x000000010138f7fc _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run()
(raylet) 15  raylet                              0x000000010138f6e4 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run()
(raylet) 16  raylet                              0x00000001007ea4c4 main + 4244 main
(raylet) 17  dyld                                0x000000018420d0e0 start + 2360 start
(raylet)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jjyao commented 3 weeks ago

object_store_memory is not a valid resource for task or actor. we should raise ValueError instead of check failure