skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.72k stars 496 forks source link

Managed Spot Job Example Not Working #2127

Closed xzrderek closed 1 year ago

xzrderek commented 1 year ago

I was following the example in the documentation here of the end-to-end example of fine-tuning a BERT model, and ran into an issue when attempting to launch a spot instance.

Here's the error I see when launching a spot instance with this command: sky spot launch -n bert-qa bert_qa.yaml

(sky) xzrderek@xzrmbpro transformers % sky spot launch -n bert-qa bert_qa.yaml
Task from YAML spec: bert_qa.yaml
Launching a new spot task 'bert-qa'. Proceed? [Y/n]: Y
I 06-23 16:10:40 execution.py:720] Translating workdir to SkyPilot Storage...
I 06-23 16:10:40 execution.py:745] Workdir '~/transformers' will be synced to cloud storage 'skypilot-workdir-xzrderek-11cc2018'.
I 06-23 16:10:40 execution.py:817] Uploading sources to cloud storage. See: sky storage ls
I 06-23 16:10:43 storage.py:1644] Created GCS bucket skypilot-workdir-xzrderek-11cc2018 in US-CENTRAL1 with storage class STANDARD
⠴ Syncing ~/transformers to gs://skypilot-workdir-xzrderek-11cc2018/E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] WARNING: gsutil rsync uses hashes when modification time is not available at
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] both the source and destination. Your crcmod installation isn't using the
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] module's C extension, so checksumming will run very slowly. If this is your
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] first rsync since updating gsutil, this rsync can take significantly longer than
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] usual. For help installing the extension, please see "gsutil help crcmod".
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] Building synchronization state...
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] If you experience problems with multiprocessing on MacOS, they might be related to https://bugs.python.org/issue33725. You can disable multiprocessing by editing your .boto config or by adding the following flag to your command: `-o "GSUtil:parallel_process_count=1"`. Note that multithreading is still available even if you disable multiprocessing.
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] Caught non-retryable exception while listing file:///Users/xzrderek/transformers: [Errno 2] No such file or directory: '/Users/xzrderek/transformers/examples/legacy/seq2seq/test_data/test_data'
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 data_utils.py:224] CommandException: Caught non-retryable exception - aborting rsync
E 06-23 16:10:45 data_utils.py:224]
E 06-23 16:10:45 storage.py:820] Could not upload ~/transformers to store name skypilot-workdir-xzrderek-11cc2018.
sky.exceptions.StorageUploadError: Upload to bucket failed for store skypilot-workdir-xzrderek-11cc2018. Please check the logs.

When looking for the file from in the repo our data is cloned from, it doesn't seem like transformers/examples/legacy/seq2seq/test_data/test_data exists, so this might not be a SkyPilot issue. However, I wanted to flag this in case to avoid confusion for future users. For reference, here is my bert_qa.yaml:

# bert_qa.yaml
name: bert_qa

resources:
  accelerators: V100:1

# Assume your working directory is under `~/transformers`.
# To make this example work, please run the following command:
# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.18.0
workdir: ~/transformers

file_mounts:

  /checkpoint:
    name: derektest
    mode: MOUNT

setup: |
  # Fill in your wandb key: copy from https://wandb.ai/authorize
  # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
  # to pass the key in the command line, during `sky spot launch`.
  echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc

  pip install -e .
  cd examples/pytorch/question-answering/
  pip install -r requirements.txt
  pip install wandb

run: |
  cd ./examples/pytorch/question-answering/
  python run_qa.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 50 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --report_to wandb \

  --run_name $SKYPILOT_TASK_ID \

  --output_dir /checkpoint/bert_qa/ \

  --save_total_limit 10 \

  --save_steps 1000
Michaelvll commented 1 year ago

Thanks for the question @xzrderek! Compared to the normal sky launch, in spot launch, we will upload the working directory to a cloud bucket first to keep the folder persistent during spot preemption and recoveries. According to the log, it seems to be a problem while uploading the working directory to the GCS bucket, which might be related to the fix in #2125, cc'ing @romilbhardwaj.

romilbhardwaj commented 1 year ago

Thanks for catching this @xzrderek!

This seems to be related to gsutil failing for symlinks (test_data is a symlink). See snippet below:

(base) ➜  test_data git:(v4.18.0) ls -la
total 0
drwxr-xr-x   5 romilb  staff   160 Jun 23 16:29 .
drwxr-xr-x  35 romilb  staff  1120 Jun 23 16:29 ..
drwxr-xr-x   4 romilb  staff   128 Jun 23 16:29 fsmt
lrwxr-xr-x   1 romilb  staff    17 Jun 23 16:29 test_data -> seq2seq/test_data
drwxr-xr-x  10 romilb  staff   320 Jun 23 16:29 wmt_en_ro
(base) ➜  test_data git:(v4.18.0) gsutil -m -o "GSUtil:parallel_process_count=1" rsync -r . gs://romilb-bench
Building synchronization state...
Caught non-retryable exception while listing file://.: [Errno 2] No such file or directory: './test_data'
CommandException: Caught non-retryable exception - aborting rsync

@landscapepainter will you be able to take a look at this?

landscapepainter commented 1 year ago

@xzrderek Thanks for taking your time to report! Will look into this shortly.

landscapepainter commented 1 year ago

@xzrderek Seems like the transformers repository we inform to git clone has a corrupted symlink just like you mentioned, transformers/examples/legacy/seq2seq/test_data/test_data. I suggest to remove the file from your local repository before running sky spot launch -n bert-qa bert_qa.yaml as a temporary work-around or git cloning newest version of transformers repo :). This will be fixed with error handling the dangling symlink and updating the example bert_qa.yaml to inform newer version for transformers repo.