ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.75k stars 5.74k forks source link

[Core] SIGSEGV when I run experimental shuffle command. #25650

Open rkooo567 opened 2 years ago

rkooo567 commented 2 years ago

What happened + What you expected to happen

In my Mac, when I ran RAY_BACKEND_LOG_LEVEL=debug python -m ray.experimental.shuffle --num-partitions=100 --partition-size=100e6 --object-store-memory=1e9 --num-nodes 2 in the master branch, it triggered a segfault on raylet

raylet.out

[2022-06-10 00:49:17,103 D 44686 99451332] (raylet) local_resource_manager.cc:36: local resources: {node:127.0.0.1: [10000]/[10000], CPU: [80000]/[80000], object_store_memory: [10000000000000]/[10000000000000], memory: [46664133640000]/[46664133640000]}
[2022-06-10 00:49:17,103 D 44686 99451332] (raylet) cluster_resource_manager.cc:38: Update node info, node_id: 1678177858206950650, node_resources: {node:127.0.0.1: 10000/10000, CPU: 80000/80000, object_store_memory: 10000000000000/10000000000000, memory: 46664133640000/46664133640000}
[2022-06-10 00:49:17,104 I 44686 99451332] (raylet) grpc_server.cc:105: NodeManager server started, listening on port 60877.
[2022-06-10 00:49:17,113 E 44686 99451332] (raylet) logging.cc:325: *** SIGSEGV received at time=1654847357 ***
[2022-06-10 00:49:17,113 E 44686 99451332] (raylet) logging.cc:325: PC: @     0x7fff2066f552  (unknown)  _platform_strlen

Not sure if it is my macbook only

Versions / Dependencies

Master

Reproduction script

Run RAY_BACKEND_LOG_LEVEL=debug python -m ray.experimental.shuffle --num-partitions=100 --partition-size=100e6 --object-store-memory=1e9 --num-nodes 2

Issue Severity

No response

rkooo567 commented 2 years ago

cc @iycheng can you run this command on your local machine and see if it is reproducible?

rkooo567 commented 2 years ago

(if so, I think it is P0)

zhe-thoughts commented 2 years ago

@rkooo567 @iycheng Is this still an issue? Thanks