Open shrekris-anyscale opened 4 weeks ago
Terminating a Serve deployment that sets object_store_memory logs native errors. See reproduction script for example.
object_store_memory
Ray on the latest master.
master
# Filename: repro.py import ray from ray import serve ray.init() @serve.deployment(ray_actor_options={"object_store_memory": 1024}) def f(): return "hello" h = serve.run(f.bind()) assert h.remote().result() == "hello" serve.shutdown() ray.shutdown()
Output:
% python repro.py 2024-06-06 15:26:09,076 INFO worker.py:1761 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (ProxyActor pid=82609) INFO 2024-06-06 15:26:11,168 proxy 10.103.209.172 proxy.py:1165 - Proxy starting on node 04f7f93cf44cbde8c1aa18d24898ee32dde17da40b6124c08ede9b0d (HTTP port: 8000). 2024-06-06 15:26:11,257 INFO handle.py:126 -- Created DeploymentHandle 'r64girlh' for Deployment(name='f', app='default'). 2024-06-06 15:26:11,257 INFO handle.py:126 -- Created DeploymentHandle '679oq3kv' for Deployment(name='f', app='default'). (ServeController pid=82599) INFO 2024-06-06 15:26:11,344 controller 82599 deployment_state.py:1598 - Deploying new version of Deployment(name='f', app='default') (initial target replicas: 1). (ServeController pid=82599) INFO 2024-06-06 15:26:11,446 controller 82599 deployment_state.py:1844 - Adding 1 replica to Deployment(name='f', app='default'). 2024-06-06 15:26:12,265 INFO handle.py:126 -- Created DeploymentHandle 'd0ywet2o' for Deployment(name='f', app='default'). 2024-06-06 15:26:12,265 INFO api.py:584 -- Deployed app 'default' successfully. 2024-06-06 15:26:12,270 INFO pow_2_scheduler.py:260 -- Got updated replicas for Deployment(name='f', app='default'): {'so674uw0'}. (ServeReplica:default:f pid=82621) INFO 2024-06-06 15:26:12,275 default_f so674uw0 95ba701e-d5e5-45be-84f8-74717de37804 replica.py:373 - __CALL__ OK 1.3ms 2024-06-06 15:26:12,382 INFO pow_2_scheduler.py:260 -- Got updated replicas for Deployment(name='f', app='default'): set(). (ServeController pid=82599) INFO 2024-06-06 15:26:12,380 controller 82599 deployment_state.py:1860 - Removing 1 replica from Deployment(name='f', app='default'). (ServeController pid=82599) INFO 2024-06-06 15:26:14,410 controller 82599 deployment_state.py:2182 - Replica(id='so674uw0', deployment='f', app='default') is stopped. (raylet) [2024-06-06 15:26:14,424 C 82586 97027363] (raylet) local_resource_manager.cc:109: Check failed: (_left_ >= _right_) 21474836480000 vs 21474846720000 (raylet) *** StackTrace Information *** (raylet) 0 raylet 0x0000000101279f7c _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<() (raylet) 1 raylet 0x000000010127caf4 _ZN3ray6RayLogD2Ev + 84 ray::RayLog::~RayLog() (raylet) 2 raylet 0x0000000100a4ab88 _ZN3ray20LocalResourceManager25FreeTaskResourceInstancesENSt3__110shared_ptrINS_21TaskResourceInstancesEEEb + 800 ray::LocalResourceManager::FreeTaskResourceInstances() (raylet) 3 raylet 0x0000000100a4b7a0 _ZN3ray20LocalResourceManager22ReleaseWorkerResourcesENSt3__110shared_ptrINS_21TaskResourceInstancesEEE + 76 ray::LocalResourceManager::ReleaseWorkerResources() (raylet) 4 raylet 0x000000010082ba6c _ZN3ray6raylet16LocalTaskManager22ReleaseWorkerResourcesENSt3__110shared_ptrINS0_15WorkerInterfaceEEE + 1148 ray::raylet::LocalTaskManager::ReleaseWorkerResources() (raylet) 5 raylet 0x000000010083dcec _ZN3ray6raylet11NodeManager16DisconnectClientERKNSt3__110shared_ptrINS_16ClientConnectionEEENS_3rpc14WorkerExitTypeERKNS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEEPKNS8_12RayExceptionE + 6280 ray::raylet::NodeManager::DisconnectClient() (raylet) 6 raylet 0x0000000100847b30 _ZN3ray6raylet11NodeManager30ProcessDisconnectClientMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEEPKh + 432 ray::raylet::NodeManager::ProcessDisconnectClientMessage() (raylet) 7 raylet 0x0000000100846120 _ZN3ray6raylet11NodeManager20ProcessClientMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEExPKh + 940 ray::raylet::NodeManager::ProcessClientMessage() (raylet) 8 raylet 0x00000001008f8b88 _ZNSt3__110__function6__funcIZN3ray6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEE3$_2NS_9allocatorISA_EEFvNS_10shared_ptrINS2_16ClientConnectionEEExRKNS_6vectorIhNSB_IhEEEEEEclEOSF_OxSK_ + 52 std::__1::__function::__func<>::operator()() (raylet) 9 raylet 0x0000000100c474f4 _ZN3ray16ClientConnection14ProcessMessageERKN5boost6system10error_codeE + 924 ray::ClientConnection::ProcessMessage() (raylet) 10 raylet 0x0000000100c5d580 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 228 EventTracker::RecordExecution() (raylet) 11 raylet 0x0000000100c51104 _ZN5boost4asio6detail7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEENS0_17mutable_buffers_1EPKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EclESG_mi + 576 boost::asio::detail::read_op<>::operator()() (raylet) 12 raylet 0x0000000100c51410 _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EES8_E11do_completeEPvPNS1_19scheduler_operationESJ_m + 288 boost::asio::detail::reactive_socket_recv_op<>::do_complete() (raylet) 13 raylet 0x000000010139a3cc _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 664 boost::asio::detail::scheduler::do_run_one() (raylet) 14 raylet 0x000000010138f7fc _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run() (raylet) 15 raylet 0x000000010138f6e4 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run() (raylet) 16 raylet 0x00000001007ea4c4 main + 4244 main (raylet) 17 dyld 0x000000018420d0e0 start + 2360 start (raylet)
Medium: It is a significant difficulty but I can work around it.
object_store_memory is not a valid resource for task or actor. we should raise ValueError instead of check failure
What happened + What you expected to happen
Terminating a Serve deployment that sets
object_store_memory
logs native errors. See reproduction script for example.Versions / Dependencies
Ray on the latest
master
.Reproduction script
Output:
Issue Severity
Medium: It is a significant difficulty but I can work around it.