When I launch Ray jobs as part of OpenShift AI (RayJobs in K8sJobMode mode), I observe that the end of logs of the job isn't always correctly captured.
The submit command (part of the Job created out of the RayJob) is the following:
and sometimes, the logs of this Pod do not contain the last lines printed by my entrypoint.sh script:
oc logs rayjob-sample-9r7vm | tail -15
│ function_trainable_17836_00193 TERMINATED 0.26546 │
│ function_trainable_17836_00194 TERMINATED 0.268351 │
│ function_trainable_17836_00195 TERMINATED 0.971191 │
│ function_trainable_17836_00196 TERMINATED 0.683966 │
│ function_trainable_17836_00197 TERMINATED 0.509735 │
│ function_trainable_17836_00198 TERMINATED 0.414847 │
│ function_trainable_17836_00199 TERMINATED 0.949224 │
╰─────────────────────────────────────────────────────────╯
The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.
--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
2024-11-12 14:55:08,840 SUCC cli.py:63 -- -----------------------------------
2024-11-12 14:55:08,840 SUCC cli.py:64 -- Job 'rayjob-sample-2zcmx' succeeded
2024-11-12 14:55:08,841 SUCC cli.py:65 -- -----------------------------------
However, if I rsh into Ray's head Pod, I see that it is correctly captured:
(app-root) sh-5.1$ ray job logs rayjob-sample-2zcmx | tail -10
│ function_trainable_17836_00197 TERMINATED 0.509735 │
│ function_trainable_17836_00198 TERMINATED 0.414847 │
│ function_trainable_17836_00199 TERMINATED 0.949224 │
╰─────────────────────────────────────────────────────────╯
The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.
--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
+ echo 'SCRIPT SUCCEEDED'
SCRIPT SUCCEEDED
This issue is at the boundary between Ray and KubeRay, but I think that it should be reproducible outside of the K8s environment, so I chose to fill the issue in this repository.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
spec:
submissionMode: "K8sJobMode"
entrypoint: bash /home/ray/samples/entrypoint.sh
runtimeEnvYAML: |
pip: []
rayClusterSpec:
rayVersion: '3.35.0' # should match the Ray version in the image of the containers
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "1"
requests:
cpu: "200m"
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
# You set volumes at the Pod level, then mount them into containers inside that Pod
- name: code-sample
configMap:
# Provide the name of the ConfigMap you want to mount.
name: ray-job-code-sample
# An array of keys from the ConfigMap to create as files
items:
- key: test_network_overhead.py
path: test_network_overhead.py
- key: entrypoint.sh
path: entrypoint.sh
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 4
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
resources:
limits:
cpu: "2"
requests:
cpu: "200m"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
entrypoint.sh: |
set -o pipefail;
set -o errexit;
set -o nounset;
set -o errtrace;
set -x;
if python /home/ray/samples/test_network_overhead.py ; then
echo "SCRIPT SUCCEEDED";
else
echo "SCRIPT FAILED";
# don't exit with a return code != 0, otherwise the RayJob->Job retries 3 times ...
fi
test_network_overhead.py: |
import os
import json
import ray
from ray.tune.utils.release_test_util import timed_tune_run
def main():
ray.init(address="auto")
num_samples = 200
results_per_second = 0.01
trial_length_s = 10
max_runtime = 500
success = timed_tune_run(
name="result network overhead",
num_samples=num_samples,
results_per_second=results_per_second,
trial_length_s=trial_length_s,
max_runtime=max_runtime,
# One trial per worker node, none get scheduled on the head node.
# See the compute config.
resources_per_trial={"cpu": 2},
)
if __name__ == "__main__":
main()
Sample launcher:
# !/bin/bash
set -o pipefail
set -o errexit
set -o nounset
set -o errtrace
set -x
try_count=0
while true; do
try_count=$((try_count+1))
echo "Try #$try_count"
oc delete -f ray-job.sample.yaml --ignore-not-found
# ensure that the job is gone
oc delete jobs/rayjob-sample --ignore-not-found
oc apply -f ray-job.sample.yaml
set +x
echo "Waiting for the job to appear ..."
while ! oc get job/rayjob-sample -oname 2>/dev/null; do
sleep 1;
done
echo "Waiting for the job to Complete ..."
oc wait --for=condition=Complete job/rayjob-sample --timeout=900s
echo "Checking the job logs ..."
if ! oc logs job/rayjob-sample | grep -E 'SCRIPT SUCCEEDED|SCRIPT FAILED'; then
echo "Termination message missing at try #{try_count}!"
oc logs job/rayjob-sample | tail -25
exit 1
fi
done
What happened + What you expected to happen
When I launch Ray jobs as part of OpenShift AI (
RayJobs
inK8sJobMode
mode), I observe that the end of logs of the job isn't always correctly captured.The submit command (part of the
Job
created out of theRayJob
) is the following:and sometimes, the logs of this Pod do not contain the last lines printed by my
entrypoint.sh
script:However, if I
rsh
into Ray's head Pod, I see that it is correctly captured:This issue is at the boundary between Ray and KubeRay, but I think that it should be reproducible outside of the K8s environment, so I chose to fill the issue in this repository.
Versions / Dependencies
2.35.0 quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
Reproduction script
Sample job (
ray-job-sample.yaml
)Sample launcher:
Issue Severity
None