ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.3k stars 5.5k forks source link

[Core] Worker pool didn't prestart num_cpus workers #29162

Open jjyao opened 1 year ago

jjyao commented 1 year ago

What happened + What you expected to happen

When the first driver registers, worker pool has the code to prestart --num_initial_python_workers_for_first_job (equals to num_cpus) workers. However, it can only successfully start --maximum_startup_concurrency (equals to number of physical cpus) workers.

The result is that on a 16 cpus machine, if we do ray.init(num_cpus=32), only 16 workers will be prestarted instead of 32.

Versions / Dependencies

master

Reproduction script

On a 16 cpus machine, run ray.init(num_cpus=32)

Issue Severity

No response

richardliaw commented 1 year ago

I don't think this is high priority, though we can log a message about this behavior.

yummydsky commented 3 months ago

Hi, I am a newbie, and I face a weird situation and not sure if my thought is correct.

We assume that there are two workers, the first worker is in the 4 CPU processes k8s node, and the second worker is in the 16 CPU processes k8s node. And I add num-cpu: 16 for the workerGroup as below.

    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 2
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams:
          num-cpus: "16"

The issue is that only 4 worker processes will be pre-created in the first worker, and 0 worker processes will be pre-created in the second worker.

     24       8 /home/ray/anaconda3/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/raylet --store_socket_name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/plasma_store --object_manager_port=0 --min_worker_port=10002 --max_worker_port=19999 --node_manager_port=0 --node_ip_address=10.244.229.28 --maximum_startup_concurrency=4 --static_resource_list=node:10.244.229.28,1.0,CPU,16,memory,4279215719,object_store_memory,1833949593 --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/default_worker.py --node-ip-address=10.244.229.28 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/raylet --redis-address=None --temp-dir=/tmp/ray --metrics-agent-port=60842 --runtime-env-agent-port=53007 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --runtime-env-agent-port=53007 --gcs-address=rayjob-sample-raycluster-5vq4d-head-svc.kube-ray.svc.cluster.local:6379 --session-name=session_2024-04-02_03-44-52_373416_8 --temp-dir=/tmp/ray --webui=10.244.229.27:8265 --cluster-id=9ad394ebea818ba53f9b7b1a873f61e496fb8d8d2e80bab4947db916 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --java_worker_command= --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/workers/setup_worker.py /home/ray/anaconda3/lib/python3.8/site-packages/ray/cpp/default_worker --ray_plasma_store_socket_name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/plasma_store --ray_raylet_socket_name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/raylet --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --ray_address=rayjob-sample-raycluster-5vq4d-head-svc.kube-ray.svc.cluster.local:6379 --ray_redis_password= --ray_session_dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8 --ray_logs_dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/logs --ray_node_ip_address=10.244.229.28 RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER --native_library_path=/home/ray/anaconda3/lib/python3.8/site-packages/ray/cpp/lib --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8 --log_dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/logs --resource_dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/runtime_resources --metrics-agent-port=60842 --metrics_export_port=8080 --runtime_env_agent_port=53007 --object_store_memory=1833949593 --plasma_directory=/dev/shm --ray-debugger-external=0 --gcs-address=rayjob-sample-raycluster-5vq4d-head-svc.kube-ray.svc.cluster.local:6379 --session-name=session_2024-04-02_03-44-52_373416_8 --labels= --cluster-id=9ad394ebea818ba53f9b7b1a873f61e496fb8d8d2e80bab4947db916 --num_prestart_python_workers=16 --dashboard_agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/agent.py --node-ip-address=10.244.229.28 --metrics-export-port=8080 --dashboard-agent-port=60842 --listen-port=52365 --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER --object-store-name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/plasma_store --raylet-name=/tmp/ray/session_2024-04-02_03-44-52_373416_8/sockets/raylet --temp-dir=/tmp/ray --session-dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8 --log-dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/logs --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --session-name=session_2024-04-02_03-44-52_373416_8 --gcs-address=rayjob-sample-raycluster-5vq4d-head-svc.kube-ray.svc.cluster.local:6379 --runtime_env_agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=10.244.229.28 --runtime-env-agent-port=53007 --gcs-address=rayjob-sample-raycluster-5vq4d-head-svc.kube-ray.svc.cluster.local:6379 --runtime-env-dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/runtime_resources --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --log-dir=/tmp/ray/session_2024-04-02_03-44-52_373416_8/logs --temp-dir=/tmp/ray

截圖 2024-04-02 下午6 58 47

I wonder know if I can run the code with ray.init(num_cpus=16) or not.