Open tppalani opened 10 months ago
Attached configuration files
ray-operator.txt ray-clutser.txt provisioner.txt awsnodetemplate.txt
For the failed job, can you check the status with ray job status Lq...
? It might have more information.
And also, when I'm submitting the job from CLI, how can we check which Python job is running (i.e., which node is using it)?
By default, the job driver script runs on the head node. But child tasks and actors in the job might run on different nodes. The best way to check is using the Ray Dashboard. The Ray State API might also be helpful from the command line, though it will need to be run from a Ray node.
HI @architkulkarni THanks for the checking failed job we don't get any JOB_ID right?
And i can able to check the logs using job submission logs when i'm doing this i can see xgboost module is not found, but i have installed ray cluster in EKS with help of karpenter and we have 4 nodes all the nodes i have installed xgboost python module but still I'm getting error.
And also can you please how to delete the failed job from ray dashboard or cli ?
ray job logs 'raysubmit_sFUs2e7y2C3YhDk9' --follow --address http://127.0.0.1:8265
$ ray job logs 'raysubmit_sFUs2e7y2C3YhDk9' --follow --address http://127.0.0.1:8265
Job submission server address: http://127.0.0.1:8265
fatal: destination path 'ray' already exists and is not an empty directory.
Traceback (most recent call last):
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 11, in <module>
import xgboost as xgb
ModuleNotFoundError: No module named 'xgboost'
---------------------------------------
Job 'raysubmit_sFUs2e7y2C3YhDk9' failed
---------------------------------------
Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
fatal: destination path 'ray' already exists and is not an empty directory.
Traceback (most recent call last):
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 11, in <module>
import xgboost as xgb
ModuleNotFoundError: No module named 'xgboost'
@architkulkarni it would be really helpful, if you give me right direction, because of application team planning to deploy the AI workload into Ray cluster after my POC.
Hi @architkulkarni can you please help me on this, i have been stuck past two weeks
~To debug the ~ Actually I'm guessing the xgboost
import error, you can print sys.path
inside a Ray job to get a list of directories where Python is trying to import from, and compare it to the directory wherexgboost
is installed on the node. Alternatively, you can specify xgboost
in the Job's runtime_env
field to install it at runtime: https://docs.ray.io/en/latest/ray-core/handling-dependencies.htmlxgboost_benchmark.py
isn't intended to be part of your job. I'm not sure why this file is being run. What's your job entrypoint
script?
To delete a job, you can use the ray job delete
CLI. You can't delete it from the dashboard.
HI @architkulkarni THanks for the checking failed job we don't get any JOB_ID right?
You can use the submission ID above.
I'm a bit confused by fatal: destination path 'ray' already exists and is not an empty directory.
. Are you git cloning the Ray repository in your job? That shouldn't be necessary.
Hi @architkulkarni thanks for helping i didn’t changed any python xgboost sample code i used as its.
And also xgboost model is coming after submitting the job as mentioned above comments.
HI @architkulkarni
I have resolved above issue, followed by i'm getting another error message while submitting job
$ ray job logs 'raysubmit_K6fX2B2DGxp6jWzu' --follow --address http://127.0.0.1:8265
Job submission server address: http://localhost:8265
Cloning into 'ray'...
Checking out files: 32% (2418/7358)
Checking out files: 33% (2429/7358)
Checking out files: 34% (2502/7358)
Checking out files: 35% (2576/7358)
Checking out files: 36% (2649/7358)
Checking out files: 37% (2723/7358)
Checking out files: 38% (2797/7358)
Checking out files: 39% (2870/7358)
Checking out files: 40% (2944/7358)
Checking out files: 41% (3017/7358)
Checking out files: 42% (3091/7358)
Checking out files: 43% (3164/7358)
Checking out files: 44% (3238/7358)
Checking out files: 45% (3312/7358)
Checking out files: 46% (3385/7358)
Checking out files: 47% (3459/7358)
Checking out files: 48% (3532/7358)
Checking out files: 49% (3606/7358)
Checking out files: 50% (3679/7358)
Checking out files: 51% (3753/7358)
Checking out files: 52% (3827/7358)
Checking out files: 53% (3900/7358)
Checking out files: 54% (3974/7358)
Checking out files: 55% (4047/7358)
Checking out files: 56% (4121/7358)
Checking out files: 57% (4195/7358)
Checking out files: 58% (4268/7358)
Checking out files: 59% (4342/7358)
Checking out files: 60% (4415/7358)
Checking out files: 61% (4489/7358)
Checking out files: 62% (4562/7358)
Checking out files: 63% (4636/7358)
Checking out files: 64% (4710/7358)
Checking out files: 65% (4783/7358)
Checking out files: 66% (4857/7358)
Checking out files: 67% (4930/7358)
Checking out files: 68% (5004/7358)
Checking out files: 69% (5078/7358)
Checking out files: 70% (5151/7358)
Checking out files: 71% (5225/7358)
Checking out files: 72% (5298/7358)
Checking out files: 73% (5372/7358)
Checking out files: 74% (5445/7358)
Checking out files: 75% (5519/7358)
Checking out files: 76% (5593/7358)
Checking out files: 77% (5666/7358)
Checking out files: 78% (5740/7358)
Checking out files: 79% (5813/7358)
Checking out files: 80% (5887/7358)
Checking out files: 81% (5960/7358)
Checking out files: 82% (6034/7358)
Checking out files: 83% (6108/7358)
Checking out files: 84% (6181/7358)
Checking out files: 85% (6255/7358)
Checking out files: 86% (6328/7358)
Checking out files: 87% (6402/7358)
Checking out files: 88% (6476/7358)
Checking out files: 89% (6549/7358)
Checking out files: 90% (6623/7358)
Checking out files: 91% (6696/7358)
Checking out files: 92% (6770/7358)
Checking out files: 93% (6843/7358)
Checking out files: 94% (6917/7358)
Checking out files: 95% (6991/7358)
Checking out files: 96% (7064/7358)
Checking out files: 97% (7138/7358)
Checking out files: 98% (7211/7358)
Checking out files: 99% (7285/7358)
Checking out files: 100% (7358/7358)
Checking out files: 100% (7358/7358), done.
Traceback (most recent call last):
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 16, in <module>
from ray.train import RunConfig, ScalingConfig
ImportError: cannot import name 'RunConfig' from 'ray.train' (/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py)
---------------------------------------
Job 'raysubmit_K6fX2B2DGxp6jWzu' failed
---------------------------------------
Why are you cloning the Ray repo at runtime? That's not recommended, Ray should already be installed on all nodes.
It's hard to know without more details, but it looks like you're trying to run xgboost_benchmark.py
, and one hypothesis is the script is taken from the master branch of Ray, but the installed Ray version is some stable Ray version which is older than master. If that's the case, there's a version incompatibility issue which can be resolved by using the xgboost_benchmark.py
from the release branch branch of the Ray version you're running.
Why are you cloning the Ray repo at runtime? That's not recommended, Ray should already be installed on all nodes.
Ray already installed as part of helm chat inside my eks cluster you can refer the above pod status. coming to the clone - clone option is defined inside the xgboost.py file its pre-defined i didn't changed anything.
Hi @architkulkarni
As you mentioned i have update ray image version to latest now i,m getting issue with s3 bucket access, https://github.com/ray-project/ray/blob/releases/2.0.0/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py
$ kubectl get rayclusters -n xgboost xgboost-kuberay
NAME AGE
xgboost-kuberay 6h20m
$ kubectl describe rayclusters -n xgboost xgboost-kuberay
request
Name: xgboost-kuberay
Namespace: xgboost
Labels: app.kubernetes.io/instance=xgboost
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kuberay
helm.sh/chart=ray-cluster-0.4.0
Annotations: meta.helm.sh/release-name: xgboost
meta.helm.sh/release-namespace: xgboost
$ kubectl get pods --selector=ray.io/cluster=xgboost-kuberay -n xgboost
NAME READY STATUS RESTARTS AGE
xgboost-kuberay-head-7ldrl 2/2 Running 0 6h25m
$
Name: xgboost-kuberay-head-7ldrl
Namespace: xgboost
Priority: 0
Service Account: xgboost-kuberay
Node: ip-10.1.2.3.us-east-2.compute.internal/10.1.2.3
Start Time: Sun, 14 Jan 2024 09:04:06 +0530
Labels: app.kubernetes.io/created-by=kuberay-operator
app.kubernetes.io/instance=xgboost
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kuberay
helm.sh/chart=ray-cluster-0.4.0
ray.io/cluster=xgboost-kuberay
ray.io/cluster-dashboard=xgboost-kuberay-dashboard
ray.io/group=headgroup
ray.io/identifier=xgboost-kuberay-head
ray.io/is-ray-node=yes
ray.io/node-type=head
Annotations: ray.io/ft-enabled: false
ray.io/health-state:
Status: Running
API Version: ray.io/v1alpha1
Limits:
cpu: 14
ephemeral-storage: 700Gi
memory: 54Gi
Requests:
cpu: 14
ephemeral-storage: 700Gi
memory: 54Gi
Kind: RayCluster
Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
Traceback (most recent call last):
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 191, in <module>
main(args)
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 150, in main
training_time = run_xgboost_training(data_path, num_workers, cpus_per_worker)
File "ray/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py", line 74, in wrapper
raise p.exception
OSError: Failing to read AWS S3 file(s): "air-example-data-2/100G-xgboost-data.parquet". Please check that file exists and has properly configured access. You can also run AWS CLI command to get more detailed error message (e.g., aws s3 ls <file-name>). See https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html and https://docs.ray.io/en/latest/data/creating-datasets.html#reading-from-remote-storage for more information.
HI @architkulkarni
Just adding note when i'm doing curl i can able to pull the bucket data from my git bash CLI local widows system. can you please look into this. see the ouput file lists
$ curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/10G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 39.2M 100 39.2M 0 0 2184k 0 0:00:18 0:00:18 --:--:-- 3080k
$ ls -lrt
total 40208
-rw-r--r-- 1 user1049089 600 Jan 16 16:49 xgboost_submit.py
-rw-r--r-- 1 user 1049089 41166379 Jan 16 17:03 8034b2644a1d426d9be3bbfa78673dfa_000000.parquet
$ curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/100G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 344 0 344 0 0 124 0 --:--:-- 0:00:02 --:--:-- 124
$ ls -lrt
total 8
-rw-r--r-- 1 user 1049089 600 Jan 16 16:49 xgboost_submit.py
-rw-r--r-- 1 user 1049089 344 Jan 16 17:04 8034b2644a1d426d9be3bbfa78673dfa_000000.parquet
Path1 10G
curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/10G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
Path 2 100G
curl -O https://air-example-data-2.s3.us-west-2.amazonaws.com/100G-xgboost-data.parquet/8034b2644a1d426d9be3bbfa78673dfa_000000.parquet --insecure
Hi @architkulkarni just having one doubt after submitting xgboost job ray head node will be use some kind of kubernetes service account right? Do you which service account will be using it?
Description
I am running EKS cluster version 1.26 with Karpenter autoscaling to manage workloads. Here is the list of steps I followed to configure the ray head node and worker node.
And also, when I'm submitting the job from CLI, how can we check which Python job is running (i.e., which node is using it)?
Link
No response