Closed xzrderek closed 1 year ago
Thanks for the question @xzrderek! Compared to the normal sky launch
, in spot launch
, we will upload the working directory to a cloud bucket first to keep the folder persistent during spot preemption and recoveries.
According to the log, it seems to be a problem while uploading the working directory to the GCS bucket, which might be related to the fix in #2125, cc'ing @romilbhardwaj.
Thanks for catching this @xzrderek!
This seems to be related to gsutil failing for symlinks (test_data
is a symlink). See snippet below:
(base) ➜ test_data git:(v4.18.0) ls -la
total 0
drwxr-xr-x 5 romilb staff 160 Jun 23 16:29 .
drwxr-xr-x 35 romilb staff 1120 Jun 23 16:29 ..
drwxr-xr-x 4 romilb staff 128 Jun 23 16:29 fsmt
lrwxr-xr-x 1 romilb staff 17 Jun 23 16:29 test_data -> seq2seq/test_data
drwxr-xr-x 10 romilb staff 320 Jun 23 16:29 wmt_en_ro
(base) ➜ test_data git:(v4.18.0) gsutil -m -o "GSUtil:parallel_process_count=1" rsync -r . gs://romilb-bench
Building synchronization state...
Caught non-retryable exception while listing file://.: [Errno 2] No such file or directory: './test_data'
CommandException: Caught non-retryable exception - aborting rsync
@landscapepainter will you be able to take a look at this?
@xzrderek Thanks for taking your time to report! Will look into this shortly.
@xzrderek Seems like the transformers repository we inform to git clone
has a corrupted symlink just like you mentioned, transformers/examples/legacy/seq2seq/test_data/test_data
. I suggest to remove the file from your local repository before running sky spot launch -n bert-qa bert_qa.yaml
as a temporary work-around or git cloning newest version of transformers repo :). This will be fixed with error handling the dangling symlink and updating the example bert_qa.yaml
to inform newer version for transformers repo.
I was following the example in the documentation here of the end-to-end example of fine-tuning a BERT model, and ran into an issue when attempting to launch a spot instance.
Here's the error I see when launching a spot instance with this command:
sky spot launch -n bert-qa bert_qa.yaml
When looking for the file from in the repo our data is cloned from, it doesn't seem like
transformers/examples/legacy/seq2seq/test_data/test_data
exists, so this might not be a SkyPilot issue. However, I wanted to flag this in case to avoid confusion for future users. For reference, here is mybert_qa.yaml
: