ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.02k stars 5.78k forks source link

Ray Clutser Setup Existing EKS cluster #42109

Open tppalani opened 10 months ago

tppalani commented 10 months ago

Description

I am running EKS cluster version 1.26 with Karpenter autoscaling to manage workloads. Here is the list of steps I followed to configure the ray head node and worker node.

  1. I have already configured Karapenter Provisioner with the following configuration; refer to the attached file.
  2. Existing configuration karpenter AWSNodeTemplate with the following configuration; refer to the attached file.
  3. Installed the NVIDIA device plugin as a YAML file, and it is running as a daemon set.
  4. Installed Kuberay operator as part of helm using Terraform; refer to the attached file.
  5. Installed a ray cluster as part of helm using Terraform; refer to the attached file.
  6. Using Port Forward to access the Ray Dashboard, refer to the screenshot.
  7. I am able to submit the xgboost py job and see the job status in the ray dashboard. When I submit the jobs from the git bash terminal, I can see the status response as successful. You may refer to the below reference output. But when I'm checking the Ray Dashboard, I can see the status of the job submission. There are two different submission IDs with different states: one is SUCCEEDED and the other is FAILED.

image image

$ ray job submit --working-dir ./ -- python xgboost_submit.py
Job submission server address: http://localhost:8265
2023-12-27 22:05:43,689 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_148cfd9114c4b3bb.zip.
2023-12-27 22:05:43,690 INFO packaging.py:518 -- Creating a file package for local directory './'.

-------------------------------------------------------
Job 'raysubmit_gGek2i6T2JeRhCAm' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_gGek2i6T2JeRhCAm
  Query the status of the job:
    ray job status raysubmit_gGek2i6T2JeRhCAm
  Request the job to be stopped:
    ray job stop raysubmit_gGek2i6T2JeRhCAm

Tailing logs until the job exits (disable with --no-wait):
Use the following command to follow this Job's logs:
ray job logs 'raysubmit_LqFarpCpzRxpQNfG' --follow

------------------------------------------
Job 'raysubmit_gGek2i6T2JeRhCAm' succeeded
------------------------------------------

And also, when I'm submitting the job from CLI, how can we check which Python job is running (i.e., which node is using it)?

Link

No response

tppalani commented 10 months ago

Attached configuration files

ray-operator.txt ray-clutser.txt provisioner.txt awsnodetemplate.txt

architkulkarni commented 10 months ago

For the failed job, can you check the status with ray job status Lq...? It might have more information.

And also, when I'm submitting the job from CLI, how can we check which Python job is running (i.e., which node is using it)?

By default, the job driver script runs on the head node. But child tasks and actors in the job might run on different nodes. The best way to check is using the Ray Dashboard. The Ray State API might also be helpful from the command line, though it will need to be run from a Ray node.

tppalani commented 10 months ago

HI @architkulkarni THanks for the checking failed job we don't get any JOB_ID right?

And i can able to check the logs using job submission logs when i'm doing this i can see xgboost module is not found, but i have installed ray cluster in EKS with help of karpenter and we have 4 nodes all the nodes i have installed xgboost python module but still I'm getting error.

And also can you please how to delete the failed job from ray dashboard or cli ?

ray job logs 'raysubmit_sFUs2e7y2C3YhDk9' --follow --address http://127.0.0.1:8265
$ ray job logs 'raysubmit_sFUs2e7y2C3YhDk9' --follow --address http://127.0.0.1:8265
Job submission server address: http://127.0.0.1:8265
fatal: destination path 'ray' already exists and is not an empty directory.
Traceback (most recent call last):
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 11, in <module>
    import xgboost as xgb
ModuleNotFoundError: No module named 'xgboost'

---------------------------------------
Job 'raysubmit_sFUs2e7y2C3YhDk9' failed
---------------------------------------

Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
fatal: destination path 'ray' already exists and is not an empty directory.
Traceback (most recent call last):
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 11, in <module>
    import xgboost as xgb
ModuleNotFoundError: No module named 'xgboost'
tppalani commented 10 months ago

@architkulkarni it would be really helpful, if you give me right direction, because of application team planning to deploy the AI workload into Ray cluster after my POC.

tppalani commented 10 months ago

Hi @architkulkarni can you please help me on this, i have been stuck past two weeks

architkulkarni commented 10 months ago

~To debug the xgboost import error, you can print sys.path inside a Ray job to get a list of directories where Python is trying to import from, and compare it to the directory wherexgboost is installed on the node. Alternatively, you can specify xgboost in the Job's runtime_env field to install it at runtime: https://docs.ray.io/en/latest/ray-core/handling-dependencies.html~ Actually I'm guessing the xgboost_benchmark.py isn't intended to be part of your job. I'm not sure why this file is being run. What's your job entrypoint script?

To delete a job, you can use the ray job delete CLI. You can't delete it from the dashboard.

HI @architkulkarni THanks for the checking failed job we don't get any JOB_ID right?

Screenshot 2024-01-09 at 1 30 39 PM

You can use the submission ID above.

architkulkarni commented 10 months ago

I'm a bit confused by fatal: destination path 'ray' already exists and is not an empty directory.. Are you git cloning the Ray repository in your job? That shouldn't be necessary.

tppalani commented 10 months ago

Hi @architkulkarni thanks for helping i didn’t changed any python xgboost sample code i used as its.

tppalani commented 10 months ago

And also xgboost model is coming after submitting the job as mentioned above comments.

tppalani commented 10 months ago

HI @architkulkarni

I have resolved above issue, followed by i'm getting another error message while submitting job

$ ray job logs 'raysubmit_K6fX2B2DGxp6jWzu' --follow --address http://127.0.0.1:8265
Job submission server address: http://localhost:8265
Cloning into 'ray'...
Checking out files:  32% (2418/7358)
Checking out files:  33% (2429/7358)
Checking out files:  34% (2502/7358)
Checking out files:  35% (2576/7358)
Checking out files:  36% (2649/7358)
Checking out files:  37% (2723/7358)
Checking out files:  38% (2797/7358)
Checking out files:  39% (2870/7358)
Checking out files:  40% (2944/7358)
Checking out files:  41% (3017/7358)
Checking out files:  42% (3091/7358)
Checking out files:  43% (3164/7358)
Checking out files:  44% (3238/7358)
Checking out files:  45% (3312/7358)
Checking out files:  46% (3385/7358)
Checking out files:  47% (3459/7358)
Checking out files:  48% (3532/7358)
Checking out files:  49% (3606/7358)
Checking out files:  50% (3679/7358)
Checking out files:  51% (3753/7358)
Checking out files:  52% (3827/7358)
Checking out files:  53% (3900/7358)
Checking out files:  54% (3974/7358)
Checking out files:  55% (4047/7358)
Checking out files:  56% (4121/7358)
Checking out files:  57% (4195/7358)
Checking out files:  58% (4268/7358)
Checking out files:  59% (4342/7358)
Checking out files:  60% (4415/7358)
Checking out files:  61% (4489/7358)
Checking out files:  62% (4562/7358)
Checking out files:  63% (4636/7358)
Checking out files:  64% (4710/7358)
Checking out files:  65% (4783/7358)
Checking out files:  66% (4857/7358)
Checking out files:  67% (4930/7358)
Checking out files:  68% (5004/7358)
Checking out files:  69% (5078/7358)
Checking out files:  70% (5151/7358)
Checking out files:  71% (5225/7358)
Checking out files:  72% (5298/7358)
Checking out files:  73% (5372/7358)
Checking out files:  74% (5445/7358)
Checking out files:  75% (5519/7358)
Checking out files:  76% (5593/7358)
Checking out files:  77% (5666/7358)
Checking out files:  78% (5740/7358)
Checking out files:  79% (5813/7358)
Checking out files:  80% (5887/7358)
Checking out files:  81% (5960/7358)
Checking out files:  82% (6034/7358)
Checking out files:  83% (6108/7358)
Checking out files:  84% (6181/7358)
Checking out files:  85% (6255/7358)
Checking out files:  86% (6328/7358)
Checking out files:  87% (6402/7358)
Checking out files:  88% (6476/7358)
Checking out files:  89% (6549/7358)
Checking out files:  90% (6623/7358)
Checking out files:  91% (6696/7358)
Checking out files:  92% (6770/7358)
Checking out files:  93% (6843/7358)
Checking out files:  94% (6917/7358)
Checking out files:  95% (6991/7358)
Checking out files:  96% (7064/7358)
Checking out files:  97% (7138/7358)
Checking out files:  98% (7211/7358)
Checking out files:  99% (7285/7358)
Checking out files: 100% (7358/7358)
Checking out files: 100% (7358/7358), done.
Traceback (most recent call last):
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 16, in <module>
    from ray.train import RunConfig, ScalingConfig
ImportError: cannot import name 'RunConfig' from 'ray.train' (/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py)

---------------------------------------
Job 'raysubmit_K6fX2B2DGxp6jWzu' failed
---------------------------------------
architkulkarni commented 10 months ago

Why are you cloning the Ray repo at runtime? That's not recommended, Ray should already be installed on all nodes.

architkulkarni commented 10 months ago

It's hard to know without more details, but it looks like you're trying to run xgboost_benchmark.py, and one hypothesis is the script is taken from the master branch of Ray, but the installed Ray version is some stable Ray version which is older than master. If that's the case, there's a version incompatibility issue which can be resolved by using the xgboost_benchmark.py from the release branch branch of the Ray version you're running.

tppalani commented 10 months ago

Why are you cloning the Ray repo at runtime? That's not recommended, Ray should already be installed on all nodes.

Ray already installed as part of helm chat inside my eks cluster you can refer the above pod status. coming to the clone - clone option is defined inside the xgboost.py file its pre-defined i didn't changed anything.

tppalani commented 10 months ago

Hi @architkulkarni

As you mentioned i have update ray image version to latest now i,m getting issue with s3 bucket access, https://github.com/ray-project/ray/blob/releases/2.0.0/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py


$ kubectl get rayclusters -n xgboost xgboost-kuberay

NAME              AGE
xgboost-kuberay   6h20m

$ kubectl describe rayclusters -n xgboost xgboost-kuberay

request
Name:         xgboost-kuberay
Namespace:    xgboost
Labels:       app.kubernetes.io/instance=xgboost
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=kuberay
              helm.sh/chart=ray-cluster-0.4.0
Annotations:  meta.helm.sh/release-name: xgboost
              meta.helm.sh/release-namespace: xgboost

$ kubectl get pods --selector=ray.io/cluster=xgboost-kuberay -n xgboost
NAME                         READY   STATUS    RESTARTS   AGE
xgboost-kuberay-head-7ldrl   2/2     Running   0          6h25m

$ 

Name:             xgboost-kuberay-head-7ldrl
Namespace:        xgboost
Priority:         0
Service Account:  xgboost-kuberay
Node:             ip-10.1.2.3.us-east-2.compute.internal/10.1.2.3
Start Time:       Sun, 14 Jan 2024 09:04:06 +0530
Labels:           app.kubernetes.io/created-by=kuberay-operator
                  app.kubernetes.io/instance=xgboost
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kuberay
                  helm.sh/chart=ray-cluster-0.4.0
                  ray.io/cluster=xgboost-kuberay
                  ray.io/cluster-dashboard=xgboost-kuberay-dashboard
                  ray.io/group=headgroup
                  ray.io/identifier=xgboost-kuberay-head
                  ray.io/is-ray-node=yes
                  ray.io/node-type=head
Annotations:      ray.io/ft-enabled: false
                  ray.io/health-state:
Status:           Running
API Version:  ray.io/v1alpha1

 Limits:
      cpu:                14
      ephemeral-storage:  700Gi
      memory:             54Gi
    Requests:
      cpu:                14
      ephemeral-storage:  700Gi
      memory:             54Gi
Kind:         RayCluster
Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):

Traceback (most recent call last):
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 191, in <module>
    main(args)
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 150, in main
    training_time = run_xgboost_training(data_path, num_workers, cpus_per_worker)
  File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 74, in wrapper
    raise p.exception
OSError: Failing to read AWS S3 file(s): "air-example-data-2/100G-xgboost-data.parquet". Please check that file exists and has properly configured access. You can also run AWS CLI command to get more detailed error message (e.g., aws s3 ls <file-name>). See https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html and https://docs.ray.io/en/latest/data/creating-datasets.html#reading-from-remote-storage for more information.
tppalani commented 10 months ago

HI @architkulkarni

Just adding note when i'm doing curl i can able to pull the bucket data from my git bash CLI local widows system. can you please look into this. see the ouput file lists

$ curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/10G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.2M  100 39.2M    0     0  2184k      0  0:00:18  0:00:18 --:--:-- 3080k

$ ls -lrt
total 40208
-rw-r--r-- 1 user1049089      600 Jan 16 16:49 xgboost_submit.py
-rw-r--r-- 1 user 1049089 41166379 Jan 16 17:03 8034b2644a1d426d9be3bbfa78673dfa_000000.parquet

$ curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/100G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   344    0   344    0     0    124      0 --:--:--  0:00:02 --:--:--   124

$ ls -lrt
total 8
-rw-r--r-- 1 user 1049089 600 Jan 16 16:49 xgboost_submit.py
-rw-r--r-- 1 user 1049089 344 Jan 16 17:04 8034b2644a1d426d9be3bbfa78673dfa_000000.parquet

Path1 10G

curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/10G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure

Path 2 100G

curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/100G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
tppalani commented 10 months ago

Hi @architkulkarni just having one doubt after submitting xgboost job ray head node will be use some kind of kubernetes service account right? Do you which service account will be using it?