Operating environment:
python 3.6.5
ray 2.3.1
kubernetes 1.18
Bug description
A ray cluster with 1 head node + 800 worker nodes was manually created in the k8s cluster through yaml files. The head pod was created first, and then 800 worker pods were created through k8s deployment. The worker pods joined the cluster through the head pod's IP. Without submitting any tasks, the head pod's 32G memory kept growing. After entering the container, it was found that the memory used by gcs kept growing, and finally the head pod oom killed.
Head pod node configuration: 8C 32G
startup command: ulimit -n 65536; ray start --head --block --no-monitor --dashboard-host=0.0.0.0 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 0 --memory 33554432 --num-gpus 0
Worker pod configuration: 1C 1G
Worker startup command: ulimit -n 65536; ray start --block --address=$HEAD_IP:6379 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 1 --memory 1048576 --num-gpus 0
The memory of the head pod keeps growing during the operation:
However, with the same configuration, 400 workers, head pod memory usage is very stable, as long as more than 5G, as shown in the following figure
What happened + What you expected to happen
Operating environment: python 3.6.5 ray 2.3.1 kubernetes 1.18
Bug description A ray cluster with 1 head node + 800 worker nodes was manually created in the k8s cluster through yaml files. The head pod was created first, and then 800 worker pods were created through k8s deployment. The worker pods joined the cluster through the head pod's IP. Without submitting any tasks, the head pod's 32G memory kept growing. After entering the container, it was found that the memory used by gcs kept growing, and finally the head pod oom killed. Head pod node configuration: 8C 32G startup command: ulimit -n 65536; ray start --head --block --no-monitor --dashboard-host=0.0.0.0 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 0 --memory 33554432 --num-gpus 0 Worker pod configuration: 1C 1G Worker startup command: ulimit -n 65536; ray start --block --address=$HEAD_IP:6379 --metrics-export-port=20001 --dashboard-agent-grpc-port=20002 --num-cpus 1 --memory 1048576 --num-gpus 0 The memory of the head pod keeps growing during the operation:
However, with the same configuration, 400 workers, head pod memory usage is very stable, as long as more than 5G, as shown in the following figure
Versions / Dependencies
python 3.6.5 ray 2.3.1 kubernetes 1.18
Reproduction script
none
Issue Severity
None