ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.16k stars 5.8k forks source link

gcs memory leak #48723

Open csaimd opened 1 week ago

csaimd commented 1 week ago

What happened + What you expected to happen

  1. Operating environment: python 3.6.5 ray 2.3.1 kubernetes 1.18

  2. Bug description A ray cluster with 1 head node + 800 worker nodes was manually created in the k8s cluster through yaml files. The head pod was created first, and then 800 worker pods were created through k8s deployment. The worker pods joined the cluster through the head pod's IP. Without submitting any tasks, the head pod's 32G memory kept growing. After entering the container, it was found that the memory used by gcs kept growing, and finally the head pod oom killed. Head pod node configuration: 8C 32G startup command: ulimit -n 65536; ray start --head --block --no-monitor --dashboard-host=0.0.0.0 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 0 --memory 33554432 --num-gpus 0 Worker pod configuration: 1C 1G Worker startup command: ulimit -n 65536; ray start --block --address=$HEAD_IP:6379 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 1 --memory 1048576 --num-gpus 0 The memory of the head pod keeps growing during the operation: image

However, with the same configuration, 400 workers, head pod memory usage is very stable, as long as more than 5G, as shown in the following figure image

Versions / Dependencies

python 3.6.5 ray 2.3.1 kubernetes 1.18

Reproduction script

none

Issue Severity

None

jjyao commented 1 week ago

Ray 2.3.1 is pretty old. Can you try the latest version and see if it has the leak?