skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

Handling error from step.run() #2

Closed infwinston closed 3 years ago

infwinston commented 3 years ago

We need to catch errors from each step.run(). https://github.com/concretevitamin/sky-experiments/blob/3e9bac359da41187060b348be48a6400704f25aa/prototype/sky/execution.py#L169

Apparently ray up failed but sky still shows execution finished.

  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node.py", line 267, in <listcomp>
    self.create_instance(
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node.py", line 440, in create_instance
    operation = self.resource.instances().insert(
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?alt=json returned "The resource 'projects/intercloud-320520/zones/us-west1-a/acceleratorTypes/nvidia-tesla-tpu-v3-8' was not found". Details: "[{'message': "The resource 'projects/intercloud-320520/zones/us-west1-a/acceleratorTypes/nvidia-tesla-tpu-v3-8' was not found", 'domain': 'global', 'reason': 'notFound'}]">
Step 000_provision finished

---------------------------
  Sky execution finished
---------------------------
concretevitamin commented 3 years ago

+1

franklsf95 commented 3 years ago

Yea I'll take care of this. It used to work but tee messed it up.

concretevitamin commented 3 years ago

Can preserve the exit code but need to change "2>&1 | tee" to something a bit more complex - https://stackoverflow.com/questions/692000/how-do-i-write-stderr-to-a-file-while-using-tee-with-a-pipe

franklsf95 commented 3 years ago

I'll work on this today.