skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.47k stars 460 forks source link

FAILED_CONTROLLER after a preemption, no error in logs #3300

Closed Hubert-Bonisseur closed 1 month ago

Hubert-Bonisseur commented 5 months ago

My job failed due to FAILED_CONTROLLER after a preemption.

sky spot logs --controller 5 doesn't show any error:

(small-yt, pid=45563) I 03-12 15:33:03 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:23 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:46 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:92] ================================== Shared connection to 35.204.42.245 closed

Please tell me if there is any other info I can share to help understand what may have caused it I am using GCS and skypilot version 0.5.0

Michaelvll commented 5 months ago

Thanks for reporting this @Hubert-Bonisseur! This is quite weird. Possibly the controller process is somehow killed.

Could you share how many spot jobs you were running concurrently and if you have seen any issue with the other spot jobs?

It would be nice to share the job task yaml you were running as well : )

Hubert-Bonisseur commented 5 months ago

I was running only one job at that time. I since launched 2 concurrent spot jobs and they are working fine so far, but there hasn't been a preemption yet. I will update then.

Here is the task.yml

name: small-yt

resources:
  cloud: gcp
  region: europe-west4
  cpus: 12+
  accelerators: A100
  memory: 6+

  disk_size: 500
  disk_tier: 'medium'

file_mounts:
  ~/secret/service_account.json: /Users/datalab/épellations/STT/finetune/secrets/finetuning-414911-4d293f61509f.json

envs:
  COMMIT: b11a10fb86feef059d4798ff883ea719e4169218
  MODEL_ID: small-yt-V2
  NUM_WORKERS: 12

setup: |
  echo "Begin setup."
  sudo apt-get update
  sudo apt-get -y install ffmpeg
  cd ~/sky_workdir
  git clone git@gitlab.company.tech:data-science/speech/finetune.git
  cd finetune_whisper
  git checkout $COMMIT
  pip install -r requirements.txt
  pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
  mkdir ~/checkpoints/
  if gsutil ls gs://finetuning-checkpoints/$MODEL_ID; then
    gsutil -m cp -r gs://finetuning-checkpoints/$MODEL_ID/* ~/checkpoints/
  else
    echo "Remote folder does not exist. Starting a new training run"
  fi  
  echo "Setup complete."

run: |
  echo "Beginning task."
  cd finetune
  export GOOGLE_APPLICATION_CREDENTIALS=$(realpath ~/secret/service_account.json)
  export PYTHONPATH=$PWD
  python finetune run configs/training_config_mosaicML.yml
github-actions[bot] commented 1 month ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 10 days with no activity.