ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.13k stars 5.61k forks source link

latest ray microbenchmark fails #38758

Open martystack opened 1 year ago

martystack commented 1 year ago

What happened + What you expected to happen

The microbenchmark does not complete on a fresh install.

(ray) mstack@taiga-001:~$ ray start --head Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]: Usage stats collection is enabled. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 10.33.110.145


Ray runtime started.

Next steps To add another node to this Ray cluster, run ray start --address='10.33.110.145:6379'

To connect to this Ray cluster: import ray ray.init()

To submit a Ray job using the Ray Jobs CLI: RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster.

To terminate the Ray runtime, run ray stop

To view the status of the cluster, use ray status

To monitor and debug Ray, view the dashboard at 127.0.0.1:8265

If connection to the dashboard fails, check your firewall settings and network configuration. (ray) mstack@taiga-001:~$ ray status ======== Autoscaler status: 2023-08-22 13:54:41.028968 ======== Node status

Healthy: 1 node_45efa0fb0062f82a1d89e5d9ccc3c27d3cd7283644d25a900d6aa478 Pending: (no pending nodes) Recent failures: (no failures)

Resources

Usage: 0.0/20.0 CPU 0B/16.55GiB memory 0B/8.28GiB object_store_memory

Demands: (no resource demands) (ray) mstack@taiga-001:~$ ray microbenchmark Tip: set TESTS_TO_RUN='pattern' to run a subset of benchmarks 2023-08-22 13:54:57,840 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.33.110.145:6379... 2023-08-22 13:54:57,846 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 single client get calls (Plasma Store) per second 6802.74 +- 454.36 single client put calls (Plasma Store) per second 2941.14 +- 56.22 multi client put calls (Plasma Store) per second 9197.58 +- 218.93 (raylet) Spilled 2400 MiB, 4 objects, write throughput 2785 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. (raylet) Spilled 6400 MiB, 9 objects, write throughput 4697 MiB/s. single client put gigabytes per second 32.07 +- 2.6 (raylet) Spilled 11200 MiB, 15 objects, write throughput 4587 MiB/s. single client tasks and get batch per second 9.71 +- 0.63 (raylet) Spilled 16880 MiB, 68 objects, write throughput 2380 MiB/s. multi client put gigabytes per second 31.42 +- 2.2 single client get object containing 10k refs per second 13.15 +- 0.1 (sm pid=1622181) all_value single client wait 1k refs per second 4.84 +- 0.03 single client tasks sync per second 451.3 +- 18.72 single client tasks async per second 9677.19 +- 325.79 multi client tasks async per second 16119.8 +- 548.7 1:1 actor calls sync per second 867.33 +- 23.21 1:1 actor calls async per second 6595.38 +- 145.43 1:1 actor calls concurrent per second 2581.24 +- 54.76 1:n actor calls async per second 7275.98 +- 129.12 n:n actor calls async per second 20428.31 +- 436.94 2023-08-22 14:00:40,427 WARNING worker.py:2037 -- WARNING: 86 PYTHON worker processes have been started on node: 45efa0fb0062f82a1d89e5d9ccc3c27d3cd7283644d25a900d6aa478 with address: 10.33.110.145. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). n:n actor calls with arg async per second 2633.87 +- 26.76 2023-08-22 14:01:09,649 WARNING worker.py:2037 -- WARNING: 126 PYTHON worker processes have been started on node: 45efa0fb0062f82a1d89e5d9ccc3c27d3cd7283644d25a900d6aa478 with address: 10.33.110.145. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). 1:1 async-actor calls sync per second 509.92 +- 14.04 1:1 async-actor calls async per second 1756.59 +- 41.25 1:1 async-actor calls with args async per second 1097.31 +- 48.56 1:n async-actor calls async per second 4966.1 +- 172.15 n:n async-actor calls async per second 12503.4 +- 125.86 2023-08-22 14:03:36,355 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.33.110.145:6379... Traceback (most recent call last): File "/home/mstack/venv/ray/bin/ray", line 8, in sys.exit(main()) File "/home/mstack/venv/ray/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2474, in main return cli() File "/home/mstack/venv/ray/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/mstack/venv/ray/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/mstack/venv/ray/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/mstack/venv/ray/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/mstack/venv/ray/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, *kwargs) File "/home/mstack/venv/ray/lib/python3.10/site-packages/ray/scripts/scripts.py", line 1822, in microbenchmark main() File "/home/mstack/venv/ray/lib/python3.10/site-packages/ray/_private/ray_perf.py", line 293, in main ray.init(resources={"custom": 100}) File "/home/mstack/venv/ray/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, **kwargs) File "/home/mstack/venv/ray/lib/python3.10/site-packages/ray/_private/worker.py", line 1528, in init raise ValueError( ValueError: When connecting to an existing cluster, resources must not be provided.

Versions / Dependencies

ray, version 2.6.3 Python 3.10.12 Distributor ID: Ubuntu Description: Ubuntu 22.04.2 LTS Release: 22.04 Codename: jammy

Reproduction script

$ray start --head $ray microbenchmark

Issue Severity

Medium: It is a significant difficulty but I can work around it.

martystack commented 1 year ago

It also fails on Mac but at a different point for a different reason.

(ray) mstack@mstack601GQ ~ % ray start --head Usage stats collection is disabled.

Local node IP: 127.0.0.1


Ray runtime started.

Next steps

To connect to this Ray cluster: import ray ray.init()

To submit a Ray job using the Ray Jobs CLI: RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html for more information on submitting Ray jobs to the Ray cluster.

To terminate the Ray runtime, run ray stop

To view the status of the cluster, use ray status

To monitor and debug Ray, view the dashboard at 127.0.0.1:8265

If connection to the dashboard fails, check your firewall settings and network configuration. (ray) mstack@mstack601GQ ~ % ray microbenchmark Tip: set TESTS_TO_RUN='pattern' to run a subset of benchmarks 2023-08-22 14:33:35,578 INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379... 2023-08-22 14:33:35,589 INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 single client get calls (Plasma Store) per second 6971.21 +- 223.35 single client put calls (Plasma Store) per second 6897.52 +- 442.34 multi client put calls (Plasma Store) per second 15883.03 +- 83.82 single client put gigabytes per second 56.51 +- 5.2 single client tasks and get batch per second 17.33 +- 0.52 (raylet) Spilled 3760 MiB, 48 objects, write throughput 2633 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. (raylet) Spilled 7840 MiB, 99 objects, write throughput 3740 MiB/s. (raylet) Spilled 12480 MiB, 157 objects, write throughput 4308 MiB/s. (raylet) Spilled 17280 MiB, 217 objects, write throughput 4536 MiB/s. multi client put gigabytes per second 15.95 +- 3.2 single client get object containing 10k refs per second 24.89 +- 0.61 single client wait 1k refs per second 9.21 +- 0.23 single client tasks sync per second 1780.81 +- 70.39 single client tasks async per second 17619.83 +- 334.14 multi client tasks async per second 19416.17 +- 110.76 1:1 actor calls sync per second 5488.15 +- 29.32 1:1 actor calls async per second 17483.99 +- 53.74 1:1 actor calls concurrent per second 10960.58 +- 87.3 1:n actor calls async per second 19048.88 +- 127.67 2023-08-22 14:38:30,088 WARNING worker.py:2006 -- WARNING: 42 PYTHON worker processes have been started on node: 0af7e9fd614a2175920c890c0f24bea942ed470c5cb0e1b8753fe327 with address: 127.0.0.1. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). n:n actor calls async per second 37300.6 +- 1014.14 2023-08-22 14:38:49,864 WARNING worker.py:2006 -- WARNING: 50 PYTHON worker processes have been started on node: 0af7e9fd614a2175920c890c0f24bea942ed470c5cb0e1b8753fe327 with address: 127.0.0.1. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds). (raylet) [2023-08-22 14:38:50,704 E 22974 2742920] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): pipe_select_interrupter: Too many open files [system:24] (raylet) [2023-08-22 14:38:50,710 E 22974 2742920] (raylet) logging.cc:104: Stack trace: (raylet) 0 raylet 0x00000001027e5264 _ZN3raylsERNSt3113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<() (raylet) 1 raylet 0x00000001027e5450 _ZN3ray16TerminateHandlerEv + 228 ray::TerminateHandler() (raylet) 2 libc++abi.dylib 0x00000001a8c0bf48 _ZSt11terminatePFvvE + 16 std::terminate() (raylet) 3 libc++abi.dylib 0x00000001a8c0ed34 cxa_get_exception_ptr + 0 cxa_get_exception_ptr (raylet) 4 libc++abi.dylib 0x00000001a8c0ece0 _ZN10cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception + 0 cxxabiv1::exception_cleanup_func() (raylet) 5 raylet 0x0000000102d7104c _ZN5boost15throw_exceptionINS_6system12system_errorEEEvRKT_RKNS_15source_locationE + 72 boost::throw_exception<>() (raylet) 6 raylet 0x0000000102d71090 _ZN5boost4asio6detail14do_throw_errorERKNS_6system10error_codeEPKcRKNS_15source_locationE + 48 boost::asio::detail::do_throw_error() (raylet) 7 raylet 0x0000000102d62164 _ZN5boost4asio6detail23pipe_select_interrupterD2Ev + 0 boost::asio::detail::pipe_select_interrupter::~pipe_select_interrupter() (raylet) 8 raylet 0x0000000102d5f278 _ZN5boost4asio6detail14kqueue_reactorC2ERNS0_17execution_contextE + 460 boost::asio::detail::kqueue_reactor::kqueue_reactor() (raylet) 9 raylet 0x0000000102614e18 _ZN5boost4asio6detail16service_registry6createINS1_14kqueue_reactorENS0_17execution_contextEEEPNS5_7serviceEPv + 36 boost::asio::detail::service_registry::create<>() (raylet) 10 raylet 0x0000000102d6718c _ZN5boost4asio6detail16service_registry14do_use_serviceERKNS0_17execution_context7service3keyEPFPS4PvES9 + 172 boost::asio::detail::service_registry::do_use_service() (raylet) 11 raylet 0x0000000102d636d0 _ZN5boost4asio6detail28reactive_socket_service_baseC2ERNS0_17execution_contextE + 56 boost::asio::detail::reactive_socket_service_base::reactive_socket_service_base() (raylet) 12 raylet 0x00000001027590c0 _ZN5boost4asio6detail16service_registry6createINS1_23reactive_socket_serviceINS0_2ip3tcpEEENS0_10io_contextEEEPNS0_17execution_context7serviceEPv + 64 boost::asio::detail::service_registry::create<>() (raylet) 13 raylet 0x0000000102d6718c _ZN5boost4asio6detail16service_registry14do_use_serviceERKNS0_17execution_context7service3keyEPFPS4PvES9 + 172 boost::asio::detail::service_registry::do_use_service() (raylet) 14 raylet 0x0000000102757a18 _Z9CheckFreei + 92 CheckFree() (raylet) 15 raylet 0x00000001024a2d9c _ZN3ray6raylet10WorkerPool15GetNextFreePortEPi + 148 ray::raylet::WorkerPool::GetNextFreePort() (raylet) 16 raylet 0x00000001024a466c _ZN3ray6raylet10WorkerPool14RegisterWorkerERKNSt3110shared_ptrINS0_15WorkerInterfaceEEEixNS2_8functionIFvNS_6StatusEiEEE + 512 ray::raylet::WorkerPool::RegisterWorker() (raylet) 17 raylet 0x00000001023db648 _ZN3ray6raylet11NodeManager35ProcessRegisterClientRequestMessageERKNSt3110shared_ptrINS_16ClientConnectionEEEPKh + 952 ray::raylet::NodeManager::ProcessRegisterClientRequestMessage() (raylet) 18 raylet 0x00000001023daf64 _ZN3ray6raylet11NodeManager20ProcessClientMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEExPKh + 1296 ray::raylet::NodeManager::ProcessClientMessage() (raylet) 19 raylet 0x0000000102480b30 _ZNSt3110function6funcIZN3ray6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEE3$_2NS_9allocatorISA_EEFvNS_10shared_ptrINS2_16ClientConnectionEEExRKNS_6vectorIhNSB_IhEEEEEEclEOSFOxSK + 56 std::1::function::func<>::operator()() (raylet) 20 raylet 0x00000001027345d0 _ZN3ray16ClientConnection14ProcessMessageERKN5boost6system10error_codeE + 828 ray::ClientConnection::ProcessMessage() (raylet) 21 raylet 0x0000000102742184 _ZN12EventTracker15RecordExecutionERKNSt318functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 108 EventTracker::RecordExecution() (raylet) 22 raylet 0x000000010273e7c4 _ZN5boost4asio6detail7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEENS0_17mutable_buffers_1EPKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EclESG_mi + 568 boost::asio::detail::read_op<>::operator()() (raylet) 23 raylet 0x000000010273eb40 _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EES8_E11do_completeEPvPNS1_19scheduler_operationESJ_m + 292 boost::asio::detail::reactive_socket_recv_op<>::do_complete() (raylet) 24 raylet 0x0000000102d65ed8 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one() (raylet) 25 raylet 0x0000000102d5aa50 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run() (raylet) 26 raylet 0x0000000102d5a938 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run() (raylet) 27 raylet 0x0000000102377e64 main + 3400 main (raylet) 28 dyld 0x00000001a88fbf28 start + 2236 start (raylet) (raylet) SIGABRT received at time=1692740330 (raylet) PC: @ 0x1a8c1c764 (unknown) pthread_kill (raylet) [2023-08-22 14:38:50,710 E 22974 2742920] (raylet) logging.cc:361: SIGABRT received at time=1692740330 (raylet) [2023-08-22 14:38:50,710 E 22974 2742920] (raylet) logging.cc:361: PC: @ 0x1a8c1c764 (unknown) pthread_kill (raylet) [2023-08-22 14:38:50,710 E 24441 2766905] core_worker.cc:201: Failed to register worker e17551908e3b22f0f854073943f5ba1760d8d7dcd4330180833c3af6 to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory 2023-08-22 14:38:52,412 WARNING worker.py:2006 -- Raylet is terminated: ip=127.0.0.1, id=0af7e9fd614a2175920c890c0f24bea942ed470c5cb0e1b8753fe327. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs: 12 raylet 0x00000001027590c0 _ZN5boost4asio6detail16service_registry6createINS1_23reactive_socket_serviceINS0_2ip3tcpEEENS0_10io_contextEEEPNS0_17execution_context7serviceEPv + 64 boost::asio::detail::service_registry::create<>() 13 raylet 0x0000000102d6718c _ZN5boost4asio6detail16service_registry14do_use_serviceERKNS0_17execution_context7service3keyEPFPS4PvES9 + 172 boost::asio::detail::service_registry::do_use_service() 14 raylet 0x0000000102757a18 _Z9CheckFreei + 92 CheckFree() 15 raylet 0x00000001024a2d9c _ZN3ray6raylet10WorkerPool15GetNextFreePortEPi + 148 ray::raylet::WorkerPool::GetNextFreePort() 16 raylet 0x00000001024a466c _ZN3ray6raylet10WorkerPool14RegisterWorkerERKNSt3110shared_ptrINS0_15WorkerInterfaceEEEixNS2_8functionIFvNS_6StatusEiEEE + 512 ray::raylet::WorkerPool::RegisterWorker() 17 raylet 0x00000001023db648 _ZN3ray6raylet11NodeManager35ProcessRegisterClientRequestMessageERKNSt3__110shared_ptrINS_16ClientConnectionEEEPKh + 952 ray::raylet::NodeManager::ProcessRegisterClientRequestMessage() 18 raylet 0x00000001023daf64 _ZN3ray6raylet11NodeManager20ProcessClientMessageERKNSt3110shared_ptrINS_16ClientConnectionEEExPKh + 1296 ray::raylet::NodeManager::ProcessClientMessage() 19 raylet 0x0000000102480b30 _ZNSt3110function6funcIZN3ray6raylet6Raylet12HandleAcceptERKN5boost6system10error_codeEE3$_2NS_9allocatorISA_EEFvNS_10shared_ptrINS2_16ClientConnectionEEExRKNS_6vectorIhNSB_IhEEEEEEclEOSFOxSK + 56 std::1::function::__func<>::operator()() 20 raylet 0x00000001027345d0 _ZN3ray16ClientConnection14ProcessMessageERKN5boost6system10error_codeE + 828 ray::ClientConnection::ProcessMessage() 21 raylet 0x0000000102742184 _ZN12EventTracker15RecordExecutionERKNSt318functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 108 EventTracker::RecordExecution() 22 raylet 0x000000010273e7c4 _ZN5boost4asio6detail7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEENS0_17mutable_buffers_1EPKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EclESG_mi + 568 boost::asio::detail::read_op<>::operator()() 23 raylet 0x000000010273eb40 _ZN5boost4asio6detail23reactive_socket_recv_opINS0_17mutable_buffers_1ENS1_7read_opINS0_19basic_stream_socketINS0_7generic15stream_protocolENS0_15any_io_executorEEES3_PKNS0_14mutable_bufferENS1_14transfer_all_tEZN3ray16ClientConnection20ProcessMessageHeaderERKNS_6system10error_codeEE3$_7EES8_E11do_completeEPvPNS1_19scheduler_operationESJ_m + 292 boost::asio::detail::reactive_socket_recv_op<>::do_complete() 24 raylet 0x0000000102d65ed8 _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 624 boost::asio::detail::scheduler::do_run_one() 25 raylet 0x0000000102d5aa50 _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run() 26 raylet 0x0000000102d5a938 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run() 27 raylet 0x0000000102377e64 main + 3400 main 28 dyld 0x00000001a88fbf28 start + 2236 start

[2023-08-22 14:38:50,710 E 22974 2742920] (raylet) logging.cc:361: *** SIGABRT received at time=1692740330 ***
[2023-08-22 14:38:50,710 E 22974 2742920] (raylet) logging.cc:361: PC: @        0x1a8c1c764  (unknown)  __pthread_kill

Traceback (most recent call last): File "/Users/mstack/venv/ray/bin/ray", line 8, in sys.exit(main()) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2474, in main return cli() File "/Users/mstack/venv/ray/lib/python3.8/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1822, in microbenchmark main() File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/ray_perf.py", line 241, in main results += timeit( File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/ray_microbenchmark_helpers.py", line 26, in timeit fn() File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/ray_perf.py", line 239, in actor_multi2_direct_arg ray.get([c.small_value_batch_arg.remote(n) for c in clients]) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper return fn(*args, *kwargs) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, kwargs) File "/Users/mstack/venv/ray/lib/python3.8/site-packages/ray/_private/worker.py", line 2495, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

jjyao commented 1 year ago
(raylet) [2023-08-22 14:38:50,704 E 22974 2742920] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): pipe_select_interrupter: Too many open files [system:24]

Can you increase the limit via ulimit?