skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

[Tests] Managed Jobs smoke test failed on latest master #4211

Closed cblmemo closed 3 weeks ago

cblmemo commented 3 weeks ago

Three smoke test are failing on latest master: test_job_pipeline, test_managed_jobs and test_managed_jobs_pipeline_failed_setup. Seems to be introduced in #4169. Notice that the job name is not correctly displayed:

$ sky jobs queue --refresh 
Fetching managed job statuses...
Managed jobs
In progress tasks: 1 RUNNING
ID  TASK  NAME                 RESOURCES           SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS        
4   -     t-managed-jobs-2e-2  1x[CPU:1+]          3 mins ago  3m 55s         2m 45s        0            RUNNING       
3   -     t-managed-jobs-2e-1  1x[CPU:1+]          4 mins ago  1m 49s         52s           0            CANCELLED     
2         2                    -                   4 mins ago  8s             -             0            CANCELLED     
 ↳  0     a                    1x[CPU:2.0+][Spot]  4 mins ago  8s             -             0            CANCELLED     
 ↳  1     b                    1x[CPU:2.0]         -           -              -             0            CANCELLED     
 ↳  2     eval1                1x[CPU:2+]          -           -              -             0            CANCELLED     
 ↳  3     eval2                1x[CPU:2+]          -           -              -             0            CANCELLED     

1         1                    -                   5 mins ago  3m 18s         15s           0            FAILED_SETUP  
 ↳  0     a                    2x[CPU:2.0+]        5 mins ago  1m 26s         7s            0            SUCCEEDED     
 ↳  1     b                    1x[CPU:2.0+]        3 mins ago  1m             7s            0            FAILED_SETUP  
 ↳  2     eval1                1x[CPU:2.0]         -           -              -             0            CANCELLED     
 ↳  3     eval2                1x[CPU:2.0]         -           -              -             0            CANCELLED  

and printing out the job info dict, we got the spec field holding the task name:

    {
        '_job_id': 4,
        '_task_name': None,
        'resources': '1x[CPU:2.0]',
        'submitted_at': None,
        'status': <ManagedJobStatus.CANCELLED: 'CANCELLED'>,
        'run_timestamp': None,
        'start_at': None,
        'end_at': 1730272252.229629,
        'last_recovered_at': -1.0,
        'recovery_count': 0,
        'job_duration': 0,
        'failure_reason': None,
        'job_id': 1,
        'task_id': 3,
        'task_name': 'eval2',
        '_job_info_job_id': None,
        'job_name': 1,
        'specs': 't-job-pipeline-ec',
        'cluster_resources': '-',
        'region': '-'
    }

suggesting maybe some database parsing error is introduced.

Version & Commit info: