ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[Core] `ray job submit` doesn't always catch the last lines of the job logs #48701

Open kpouget opened 3 days ago

kpouget commented 3 days ago

What happened + What you expected to happen

When I launch Ray jobs as part of OpenShift AI (RayJobs in K8sJobMode mode), I observe that the end of logs of the job isn't always correctly captured.

The submit command (part of the Job created out of the RayJob) is the following:

        - ray
        - job
        - submit
        - --address
        - http://rayjob-sample-raycluster-25q9n-head-svc.topsail.svc.cluster.local:8265
        - --runtime-env-json
        - '{"pip":[]}'
        - --submission-id
        - rayjob-sample-2zcmx
        - --
        - bash
        - /home/ray/samples/entrypoint.sh

and sometimes, the logs of this Pod do not contain the last lines printed by my entrypoint.sh script:

oc logs rayjob-sample-9r7vm | tail -15
│ function_trainable_17836_00193   TERMINATED   0.26546   │
│ function_trainable_17836_00194   TERMINATED   0.268351  │
│ function_trainable_17836_00195   TERMINATED   0.971191  │
│ function_trainable_17836_00196   TERMINATED   0.683966  │
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
2024-11-12 14:55:08,840 SUCC cli.py:63 -- -----------------------------------
2024-11-12 14:55:08,840 SUCC cli.py:64 -- Job 'rayjob-sample-2zcmx' succeeded
2024-11-12 14:55:08,841 SUCC cli.py:65 -- -----------------------------------

However, if I rsh into Ray's head Pod, I see that it is correctly captured:

(app-root) sh-5.1$ ray job logs rayjob-sample-2zcmx | tail -10
│ function_trainable_17836_00197   TERMINATED   0.509735  │
│ function_trainable_17836_00198   TERMINATED   0.414847  │
│ function_trainable_17836_00199   TERMINATED   0.949224  │
╰─────────────────────────────────────────────────────────╯

The result network overhead test took 6.04 seconds, which is below the budget of 500.00 seconds. Test successful.

--- PASSED: RESULT NETWORK OVERHEAD ::: 6.04 <= 500.00 ---
+ echo 'SCRIPT SUCCEEDED'
SCRIPT SUCCEEDED

This issue is at the boundary between Ray and KubeRay, but I think that it should be reproducible outside of the K8s environment, so I chose to fill the issue in this repository.

Versions / Dependencies

2.35.0 quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26

Reproduction script

Sample job (ray-job-sample.yaml)

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  submissionMode: "K8sJobMode"
  entrypoint: bash /home/ray/samples/entrypoint.sh

  runtimeEnvYAML: |
    pip: []

  rayClusterSpec:
    rayVersion: '3.35.0' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: test_network_overhead.py
                    path: test_network_overhead.py
                  - key: entrypoint.sh
                    path: entrypoint.sh
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 4
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "2"
                  requests:
                    cpu: "200m"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  entrypoint.sh: |
    set -o pipefail;
    set -o errexit;
    set -o nounset;
    set -o errtrace;
    set -x;

    if python /home/ray/samples/test_network_overhead.py ; then
        echo "SCRIPT SUCCEEDED";
    else
        echo "SCRIPT FAILED";
        # don't exit with a return code != 0, otherwise the RayJob->Job retries 3 times ...
    fi

  test_network_overhead.py: |
    import os
    import json

    import ray

    from ray.tune.utils.release_test_util import timed_tune_run

    def main():
        ray.init(address="auto")

        num_samples = 200

        results_per_second = 0.01
        trial_length_s = 10

        max_runtime = 500

        success = timed_tune_run(
            name="result network overhead",
            num_samples=num_samples,
            results_per_second=results_per_second,
            trial_length_s=trial_length_s,
            max_runtime=max_runtime,
            # One trial per worker node, none get scheduled on the head node.
            # See the compute config.
            resources_per_trial={"cpu": 2},
        )

    if __name__ == "__main__":
        main()

Sample launcher:

# !/bin/bash

set -o pipefail
set -o errexit
set -o nounset
set -o errtrace
set -x

try_count=0
while true; do
    try_count=$((try_count+1))
    echo "Try #$try_count"
    oc delete -f ray-job.sample.yaml --ignore-not-found
    # ensure that the job is gone
    oc delete jobs/rayjob-sample --ignore-not-found
    oc apply -f ray-job.sample.yaml

    set +x
    echo "Waiting for the job to appear ..."

    while  ! oc get job/rayjob-sample -oname 2>/dev/null; do
        sleep 1;
    done

    echo "Waiting for the job to Complete ..."
    oc wait --for=condition=Complete job/rayjob-sample --timeout=900s

    echo "Checking the job logs ..."
    if ! oc logs job/rayjob-sample | grep -E 'SCRIPT SUCCEEDED|SCRIPT FAILED'; then
        echo "Termination message missing at try #{try_count}!"
        oc logs job/rayjob-sample | tail -25
        exit 1
    fi
done

Issue Severity

None

MortalHappiness commented 1 day ago

Note for myself: oc is equal to kubectl