skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

Bug: `stream_logs` Fails Due to Incorrect Job ID Handling and Duplicate Job Names in Managed Jobs #4273

Closed andylizf closed 2 weeks ago

andylizf commented 2 weeks ago

The stream_logs function for managed jobs encounters errors when retrieving logs due to improper handling of job IDs and duplicate job names. Steps to reproduce:

  1. Start a managed job:
    jobs launch ./tests/test_yamls/pipeline_gcp.yaml --cloud gcp
  2. Attempt to fetch logs for the job:
    jobs logs --controller --name=pipeline

This triggers the following error:

Traceback (most recent call last):
  File "<string>", line 71, in <module>
  File "<string>", line 42, in stream_logs
TypeError: sequence item 0: expected str instance, int found

The command terminates with exit code 1.

Attempting to convert job['job_id'] to str doesn't resolve the issue. Instead, it leads to another error:

jobs logs --controller --name=pipeline

With the output:

Traceback (most recent call last):
  File "<string>", line 71, in <module>
  File "<string>", line 43, in stream_logs
ValueError: Multiple managed jobs found with name 'pipeline' (Job IDs: 1, 1, 1, 1). Please specify the job_id instead.

The command still exits with code 1.

Expected Behavior

The stream_logs function should handle job IDs correctly, avoiding type errors and managing duplicate job names gracefully without manual intervention.

Thanks to @euclidgame for identifying this issue.