ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.71k stars 5.73k forks source link

Ray component: Core : AssertionError: assert ray.experimental.internal_kv._internal_kv_initialized() #47167

Open jm-nab opened 2 months ago

jm-nab commented 2 months ago

What happened + What you expected to happen

Expected ray to begin processing.

However I got an error.

  File "/app/src/backend/loaders/portal/run.py", line 102, in start
    processing_task = process_space.remote(space_key, username, api_key)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/remote_function.py", line 139, in _remote_proxy
    return self._remote(args=args, kwargs=kwargs, **self._default_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 20, in auto_init_wrapper
    auto_init_ray()
  File "/usr/local/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 14, in auto_init_ray
    ray.init()
  File "/usr/local/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/worker.py", line 1681, in init
    _global_node = ray._private.node.Node(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ray/_private/node.py", line 361, in __init__
    self._record_stats()
  File "/usr/local/lib/python3.11/site-packages/ray/_private/node.py", line 1782, in _record_stats
    assert ray.experimental.internal_kv._internal_kv_initialized()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Would it be the case that I just got lucky with local development on docker compose since 6379 was already defaulted to a running redis instance?

Other things I have tried:

Versions / Dependencies

CPython 3.11.9 Ray 2.32.0

Reproduction script

I ssh'd into a worker node and did the following, and wasn't able to reproduce the same issue that the webworker is running into:

>>> ray._private.worker.global_worker
<ray._private.worker.Worker object at 0x7c58d067f390>

>>> ray.init()
2024-08-15 17:00:11,933 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: raycluster-kuberay-head-svc:6379...
2024-08-15 17:00:11,989 INFO worker.py:1779 -- Connected to Ray cluster. View the dashboard at http://xxx:8265
RayContext(dashboard_url='xxx:8265', python_version='3.11.9', ray_version='2.32.0', ray_commit='607f2f30f5f21543b6a5568ee77ea779eeba30a8')

>>> ray._private.worker.global_worker.gcs_client.address
'raycluster-kuberay-head-svc:6379'

Head node:

(base) ray@raycluster-kuberay-head-fpqsl:~$ set | grep REDIS
RAYCLUSTER_KUBERAY_HEAD_SVC_SERVICE_PORT_REDIS=6379
RAY_REDIS_ADDRESS=redis.backend-loaders.svc.cluster.local:6379
REDIS_PASSWORD=

ps aux from head node:


/bin/bash -lc -- ulimit -n 65536; ray start 
    --head 
    --object-store-memory 500000000 
    --dashboard-host=0.0.0.0 
    --dashboard-port=8265 
    --port=6379 
    --ray-client-server-port=10001 
    --block

/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/
/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py 
    --logs-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --gcs-address=172.0.2.23:6379 
    --monitor-ip=172.0.2.23

/home/ray/anaconda3/bin/python -m ray.util.client.server 
    --address=172.0.2.23:6379 
    --host=0.0.0.0 
    --port=10001 
    --mode=proxy 
    --runtime-env-agent-address=http://172.0.2.23:41272

/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/dashboard.py 
    --host=0.0.0.0 
    --port=8265 
    --port-retries=0 
    --temp-dir=/tmp/ray 
    --log-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --session-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --gcs-address=172.0.2.23:6379 
    --node-ip-address=172.0.2.23

/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet 
    --raylet_socket_name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/raylet 
    --store_socket_name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/plasma_store 
    --object_manager_port=0 
    --min_worker_port=10002 
    --max_worker_port=19999 
    --node_manager_port=0 
    --node_id=9e7ded1f560d4290a5d83b1e47694b22b67905a95cfdb697db20fba2 
    --node_ip_address=172.0.2.23 
    --maximum_startup_concurrency=1 
    --static_resource_list=node:172.0.2.23,1.0,node:__internal_head__,1.0,CPU,1,memory,984609127,object_store_memory,500000000 
    --python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/workers/setup_worker.py

/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/workers/default_worker.py 
    --node-ip-address=172.0.2.23 
    --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER 
    --object-store-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/plasma_store 
    --raylet-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/raylet 
    --redis-address=redis.backend-loaders.svc.cluster.local:6379 
    --metrics-agent-port=64664 
    --runtime-env-agent-port=41272 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --runtime-env-agent-port=41272 
    --gcs-address=172.0.2.23:6379 
    --session-name=session_2024-08-16_10-42-02_475540_8 
    --temp-dir=/tmp/ray 
    --webui=172.0.2.23:8265 
    --cluster-id=fadc43e16272fbb71d4bf31050fefdefc047a89ed12cbe608920d203
RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER 
    --java_worker_command= 
    --cpp_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/workers/setup_worker.py

/home/ray/anaconda3/lib/python3.11/site-packages/ray/cpp/default_worker 
    --ray_plasma_store_socket_name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/plasma_store 
    --ray_raylet_socket_name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/raylet 
    --ray_node_manager_port=RAY_NODE_MANAGER_PORT_PLACEHOLDER 
    --ray_address=172.0.2.23:6379 
    --ray_redis_password= 
    --ray_session_dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --ray_logs_dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --ray_node_ip_address=172.0.2.23
RAY_WORKER_DYNAMIC_OPTION_PLACEHOLDER 
    --native_library_path=/home/ray/anaconda3/lib/python3.11/site-packages/ray/cpp/lib 
    --temp_dir=/tmp/ray 
    --session_dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --log_dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --resource_dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/runtime_resources 
    --metrics-agent-port=64664 
    --metrics_export_port=51656 
    --runtime_env_agent_port=41272 
    --object_store_memory=500000000 
    --plasma_directory=/dev/shm 
    --ray-debugger-external=0 
    --gcs-address=172.0.2.23:6379 
    --session-name=session_2024-08-16_10-42-02_475540_8 
    --labels= 
    --cluster-id=fadc43e16272fbb71d4bf31050fefdefc047a89ed12cbe608920d203 
    --head 
    --num_prestart_python_workers=1 
    --dashboard_agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/agent.py 
    --node-ip-address=172.0.2.23 
    --metrics-export-port=51656 
    --dashboard-agent-port=64664 
    --listen-port=52365 
    --node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER 
    --object-store-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/plasma_store 
    --raylet-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/raylet 
    --temp-dir=/tmp/ray 
    --session-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --log-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --session-name=session_2024-08-16_10-42-02_475540_8 
    --gcs-address=172.0.2.23:6379 
    --runtime_env_agent_command=/home/ray/anaconda3/bin/python -u

/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/runtime_env/agent/main.py 
    --node-ip-address=172.0.2.23 
    --runtime-env-agent-port=41272 
    --gcs-address=172.0.2.23:6379 
    --runtime-env-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/runtime_resources 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --log-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --temp-dir=/tmp/

/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/log_monitor.py 
    --session-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --logs-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --gcs-address=172.0.2.23:6379 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5

/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/agent.py 
    --node-ip-address=172.0.2.23 
    --metrics-export-port=51656 
    --dashboard-agent-port=64664 
    --listen-port=52365 
    --node-manager-port=44001 
    --object-store-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/plasma_store 
    --raylet-name=/tmp/ray/session_2024-08-16_10-42-02_475540_8/sockets/raylet 
    --temp-dir=/tmp/ray 
    --session-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8 
    --log-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --session-name=session_2024-08-16_10-42-02_475540_8 
    --gcs-address=172.0.2.23:6379 
    --agent-id 424238335

/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/runtime_env/agent/main.py 
    --node-ip-address=172.0.2.23 
    --runtime-env-agent-port=41272 
    --gcs-address=172.0.2.23:6379 
    --runtime-env-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/runtime_resources 
    --logging-rotate-bytes=536870912 
    --logging-rotate-backup-count=5 
    --log-dir=/tmp/ray/session_2024-08-16_10-42-02_475540_8/logs 
    --temp-dir=/tmp/

Issue Severity

High: It blocks me from completing my task.

Related Resources:

rkooo567 commented 2 months ago

can you tell me what happens if you call ray.init() before

processing_task = process_space.remote(space_key, username, api_key)

?

jm-nab commented 2 months ago

can you tell me what happens if you call ray.init() before

processing_task = process_space.remote(space_key, username, api_key)

?

I had a helper function called connect_ray which would check if not ray.is_initialized(): it would then run ray.init(...config) just before process_space.remote

After looking at this the last few days, I did the following.

1) ensured that all dependencies were installed on both the workers AND head 2) Skipped the auto_init_ray logic by specifying the following ENV var: https://github.com/ray-project/ray/blob/88a6c3961db6c5c9e84b9751f8d4ae2e47c7eece/python/ray/_private/auto_init_hook.py#L7

I'm not sure if I'd be able to help contribute.

The main confusion for me was with the AssertionError it asserted the existence of the attributes in ray.experimental.internal_kv._internal_kv_initialized, and it took me quite a bit of digging to determine where and how that attribute was being attached, initialized, and setup.

Is there any kind of helpful message that could be added? I'd be more than happy to open a PR for the various cases on why the assertion is being made, and what the troubleshooting hint would be.

Something like this?:

assert ray.experimental.internal_kv._internal_kv_initialized(), "AssertionError: ray.experimental.internal_kv._internal_kv_initialized(): GCSClient hasn't been initialized: did the head node fail to start, are there multiple instances running, are the dependencies deployed to the head and worker nodes, ..."