skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.74k stars 499 forks source link

[Core] SkyPilot task not turn into succeed/failed state after it finishes #3858

Open Michaelvll opened 2 months ago

Michaelvll commented 2 months ago

A user reported this issue happening for tasks with docker run in run section or image_id: docker:xxx. We need to reproduce it.

Version & Commit info:

landscapepainter commented 2 months ago

Tried out from our current master branch using a simple task yaml, but was not able to reproduce the issue. Was wondering if this is not what we are looking for or have more context to the issue.

  1. with docker run: docker_run_test.yaml:
    run: |
    docker run hello-world
$ sky launch docker_run_test.yaml -c docker_run_test --cloud gcp -y
...
I 08-25 19:15:21 log_lib.py:412] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.128.0.58']
(task, pid=4100) Unable to find image 'hello-world:latest' locally
(task, pid=4100) latest: Pulling from library/hello-world
(task, pid=4100) c1ec31eb5944: Pulling fs layer
(task, pid=4100) c1ec31eb5944: Verifying Checksum
(task, pid=4100) c1ec31eb5944: Download complete
(task, pid=4100) c1ec31eb5944: Pull complete
(task, pid=4100) Digest: sha256:53cc4d415d839c98be39331c948609b659ed725170ad2ca8eb36951288f81b75
(task, pid=4100) Status: Downloaded newer image for hello-world:latest
(task, pid=4100)
(task, pid=4100) Hello from Docker!
(task, pid=4100) This message shows that your installation appears to be working correctly.
(task, pid=4100)
(task, pid=4100) To generate this message, Docker took the following steps:
(task, pid=4100)  1. The Docker client contacted the Docker daemon.
(task, pid=4100)  2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(task, pid=4100)     (amd64)
(task, pid=4100)  3. The Docker daemon created a new container from that image which runs the
(task, pid=4100)     executable that produces the output you are currently reading.
(task, pid=4100)  4. The Docker daemon streamed that output to the Docker client, which sent it
(task, pid=4100)     to your terminal.
(task, pid=4100)
(task, pid=4100) To try something more ambitious, you can run an Ubuntu container with:
(task, pid=4100)  $ docker run -it ubuntu bash
(task, pid=4100)
(task, pid=4100) Share images, automate workflows, and more with a free Docker ID:
(task, pid=4100)  https://hub.docker.com/
(task, pid=4100)
(task, pid=4100) For more examples and ideas, visit:
(task, pid=4100)  https://docs.docker.com/get-started/
(task, pid=4100)
INFO: Job finished (status: SUCCEEDED).
$ sky logs docker_run_test --status
Getting job status...
Job 1: SUCCEEDED

Status successfully shows as SUCCEEDED.

1.1 Resubmit a job that will fail to see if the status changes to FAILED. failing_test.yaml:

run: |
  docker run nonexistentimage
$ sky launch failing_test.yaml -c docker_run_test -y
...
Task from YAML spec: failing_test.yaml
Running task on cluster docker_run_test...
I 08-25 19:51:20 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2024-08-25-19-51-19-484016/provision.log
I 08-25 19:51:23 provisioner.py:65] Launching on GCP us-central1 (us-central1-a)
I 08-25 19:51:34 provisioner.py:450] Successfully provisioned or found existing instance.
I 08-25 19:51:50 provisioner.py:552] Successfully provisioned cluster: docker_run_test
I 08-25 19:51:55 cloud_vm_ray_backend.py:3276] Job submitted with Job ID: 2
I 08-25 19:51:57 log_lib.py:412] Start streaming logs for job 2.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.128.0.88']
(task, pid=4100) Unable to find image 'nonexistentimage:latest' locally
(task, pid=4100) docker: Error response from daemon: pull access denied for nonexistentimage, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
(task, pid=4100) See 'docker run --help'.
ERROR: Job 2 failed with return code list: [125]
INFO: Job finished (status: FAILED).
$ sky logs docker_run_test --status
Getting job status...
Job 2: FAILED

Status successfully shows as FAILED.

  1. with image_id: docker:ubuntu:20.04: docker_image_id_test.yaml:
    
    resources:
    image_id: docker:ubuntu:20.04

run: | echo "Beginning task."

$ sky launch docker_image_id_test.yaml -c docker_image_id_test --cloud gcp -y ... I 08-25 19:33:48 cloud_vm_ray_backend.py:3276] Job submitted with Job ID: 1 I 08-25 19:33:49 log_lib.py:412] Start streaming logs for job 1. INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed). INFO: Waiting for task resources on 1 node. This will block if the cluster is full. INFO: All task resources reserved. INFO: Reserved IPs: ['10.128.0.85'] (task, pid=6205) Beginning task. INFO: Job finished (status: SUCCEEDED).

$ sky logs docker_image_id_test --status Getting job status... Job 1: SUCCEEDED

**Status successfully shows as `SUCCEEDED`.**

2.1 Resubmit a job that will fail to see if the status changes to `FAILED`.
`failing_test.yaml`:

resources: image_id: docker:ubuntu:20.04

run: | docker run nonexistentimage

$ sky launch failing_test.yaml -c docker_image_id_test -y ... Task from YAML spec: failing_test.yaml Running task on cluster docker_image_id_test... I 08-25 19:55:15 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2024-08-25-19-55-14-151872/provision.log I 08-25 19:55:17 provisioner.py:65] Launching on GCP us-central1 (us-central1-a) I 08-25 19:55:29 provisioner.py:450] Successfully provisioned or found existing instance. I 08-25 19:55:59 provisioner.py:552] Successfully provisioned cluster: docker_image_id_test I 08-25 19:56:04 cloud_vm_ray_backend.py:3276] Job submitted with Job ID: 2 I 08-25 19:56:05 log_lib.py:412] Start streaming logs for job 2. INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed). INFO: Waiting for task resources on 1 node. This will block if the cluster is full. INFO: All task resources reserved. INFO: Reserved IPs: ['10.128.0.85'] (task, pid=6208) python3: can't open file '/root/sky_workdir/non_existent_script.py': [Errno 2] No such file or directory ERROR: Job 2 failed with return code list: [2] INFO: Job finished (status: FAILED).

$ sky logs docker_image_id --status Getting job status... Job 2: FAILED


**Status successfully shows as `FAILED`.**