spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

GCP: 30 minutes for runtimeconfig.v1beta1.waiter Timeout expired #108

Closed turian closed 2 years ago

turian commented 2 years ago

On GCP, I am using a spotty.yaml that previously worked, but does not currently. I suspect that the volume is large (2 TB) and some sort of timeout is happening.

It takes about 31 minutes when I run spotty start and I get the following error:

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-hearpreprocess-hearpreprocess-i2-joseph" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"504","ResourceErrorMessage":"Timeout expired."}

Here is my config:


project:
  name: hearpreprocess
  syncFilters:
    - exclude:
        - '*/__pycache__/*'
        - .git/*
        - .idea/*
        - .mypy_cache/*
        - _workdir/*
        - hear-2021*.tar.gz
        - hear-2021*/*
        - hearpreprocess.egg-info/*
        - tasks/*

containers:
  - projectDir: /workspace/project
    image: turian/hearpreprocess
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: hearpreprocess-i2-joseph
    provider: gcp
    parameters:
      zone: europe-west4-a
      machineType: n1-standard-8
      preemptibleInstance: False
      gpu:
        type: nvidia-tesla-v100
        count: 1
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      volumes:
        - name: workspace
          parameters:
            size: 2000
turian commented 2 years ago

Hmmm I'm even getting this now with a spotty.yaml that used to fire up very quickly for me. Please help!

# You must delete disks manually on GCP :\
# https://console.cloud.google.com/compute/disks?project=hear2021-evaluation

project:
  name: heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - embeddings/*
        - .mypy_cache/*
        - lightning_logs/*
        - logs/*
        - heareval.egg-info/*
        - pretrained/*
        - wandb/*

containers:
  - projectDir: /workspace/project
    #file: docker/Dockerfile-cuda11.2
    image: turian/heareval
    #image: turian/heareval:cuda11.2
    ports:
      # TensorBoard
      - containerPort: 6006
        hostPort: 6006
      # Jupyter
      - containerPort: 8888
        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']
instances:
#          parameters:
  - name: heareval-i1-joseph
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-a
      # V100
      machineType: n1-highmem-16
      #machineType: n1-highmem-4
      #machineType: n1-highmem-8
      #machineType: n1-standard-4
      #machineType: n1-standard-16
      #machineType: n1-standard-8
      # One CPU seems to exhaust memory and crash
      #machineType: n1-standard-1
      ## A100. only west3 or maybe west4? :\
      #machineType: a2-highgpu-1g
      preemptibleInstance: False
#      spotInstance: True
      gpu:
        type: nvidia-tesla-v100
        #type: nvidia-tesla-a100
        #count: 1
        count: 2
        #count: 4
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      # https://github.com/spotty-cloud/spotty/issues/102
      #imageUri: projects/ml-images/global/images/family/common-dl-gpu-debian-10
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      ports: [6006, 8888]
#      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 250
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
            mountDir: /workspace
Operation completed over 2 objects/7.2 KiB.

Creating disks...
  - disk "heareval-heareval-i1-joseph-workspace" will be attached

Preparing the deployment template...
  - image URL: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
  - zone: europe-west4-a
  - on-demand VM
  - GPUs: 2 x nvidia-tesla-v100

Volumes:
+-----------+------------+------+-----------------+
| Name      | Mount Path | Type | Deletion Policy |
+===========+============+======+=================+
| workspace | /workspace | Disk | Retain Volume   |
+-----------+------------+------+-----------------+

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-heareval-heareval-i1-joseph" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"412","ResourceErrorMessage":"Failure condition satisfied."}

Why?

turian commented 2 years ago

@apls777 I have referred to https://github.com/spotty-cloud/spotty/issues/103

Are there other ways of removing the docker cache? Perhaps it's because the docker image was created a while ago?

apls777 commented 2 years ago

@turian I was able to reproduce this problem a couple of times, but not anymore for some reason. I added logging for the startup script to check the error next time it happens.

Please, check out the dev branch and install spotty from there: pip install -e /path/to/spotty, check the version: spotty -V - it should say 1.3.3 and try to start an instance again.

If you'll get the same error, connect to the host OS: spotty sh -H, then execute cat /var/log/startup-script.log on the instance to check the logs.

turian commented 2 years ago

@apls777 So I tried it with master (1.3.3) and I can spotty sh in, but then I get:

Use the "spotty start -C" command to start it.

Pane is dead (status 0, Tue Oct 19 22:42:32 2021)

I tried spotty start -C and it seems to pull the container but then when I spotty sh I get the same message.

If I do spotty sh -H and cat /var/log/startup-script.log I have:

..........................................................................................................................................................................................................................................................................................................................

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

+ mkdir -pm 777 /tmp/spotty
+ mkdir -pm 777 /tmp/spotty/containers
+ /tmp/spotty/instance/scripts/startup/02_mount_volumes.sh
+ DEVICE_NAMES=("disk-1")
+ MOUNT_DIRS=("/workspace")
+ for i in ${!DEVICE_NAMES[*]}

7cd228bee5c5: Waiting
7efa25062fe9: Waiting
ec4c842b9cbf: Waiting
4f4fb700ef54: Pulling fs layer
28bb32c16dbe: Waiting
613e8fb47b11: Waiting
22a020d91470: Pulling fs layer
9956b4631a70: Waiting
f2cdd5ac7c27: Waiting
b49768bdc151: Pulling fs layer
a66aaf282e0f: Pulling fs layer
68297c525da0: Waiting
ffd8d2e41dd5: Waiting
0d29c87088e8: Pulling fs layer
31ba2cb05766: Waiting
fd718af0a522: Waiting
9e61b31593ca: Waiting
c37d796e6d5f: Pulling fs layer
b4ebd545da9e: Waiting
b4eeafcafe38: Waiting
6ff9d09e1219: Pulling fs layer
1ecb
0d29c87088e8: Verifying Checksum
0d29c87088e8: Download complete
c37d796e6d5f: Verifying Checksum
c37d796e6d5f: Download complete
6ff9d09e1219: Download complete
bbd96c0ec1ae: Download complete
32680e958fcb: Pull complete
1585ca522578: Pull complete
31df848d3cfc: Verifying Checksum
31df848d3cfc: Download complete
ecb2a50cdff9: Pull complete
17f1afbc84d8: Pull complete
70567fad67e7: Pull complete
be63cf966a71: Verifying Checksum
be63cf966a71: Download complete
d770d5f8ec7d: Pull complete
142d1f51202b: Pull complete
be63cf966a71: Pull complete
fcf8dbce2ffa: Pull complete
f249b9a57a97: Pull complete
c9dba10588ba: Pull complete
6327f62e21a4: Pull
ec4c842b9cbf: Pull complete
c7b14aad8447: Pull complete
28bb32c16dbe: Pull complete
ffd8d2e41dd5: Pull complete
fd718af0a522: Pull complete
cb1b5a748fca: Pull complete
1698eb990c48: Pull complete
b4eeafcafe38: Pull complete
82e225063d60: Pull complete
de66f7b3f8cb: Pull complete
1ecb75760dad: Pull complete
9055f1b67372: Pull complete
31ba2cb05766: Pull complete
b4ebd545da9e: Pull complete
b96b1f920bc2: Pull complete
unexpected EOF
+ PULL_EXIT_CODE=1
+ '[' 1 -ne 125 ']'
+ break
+ '[' 1 -ne 0 ']'
+ exit 1

How do I resolve this?

apls777 commented 2 years ago

Can you share your spotty.yaml? Is that the one you posted in the previous message?

apls777 commented 2 years ago

@turian The last config worked for me when I removed the instances[]...volumes[]...mountDir parameter (see #106). The container is crashed the first time for some reason, but I was able to start it again with spotty start -C. Can you, please, test it with your latest image?

turian commented 2 years ago

I removed instances[]...volumes[]...mountDir but left containers > volumeMounts > volumeMounts. I am still having errors even after spotty start -C.

Here is my log:


Downloading driver from GCS location and install: gs://nvidia-drivers-us-public/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

+ mkdir -pm 777 /tmp/spotty
+ mkdir -pm 777 /tmp/spotty/containers
+ /tmp/spotty/instance/scripts/startup/02_mount_volumes.sh
+ DEVICE_NAMES=("disk-1")
+ MOUNT_DIRS=("/mnt/heareval-heareval-i1-joseph-workspace")
+ for i in ${!DEVICE_NAMES[*]}
+ DEVICE=/dev/disk/by-id/google-disk-1
+ MOUNT_DIR=/mnt/heareval-heareval-i1-joseph-workspace
+ blkid -o value -s TYPE /dev/disk/by-id/google-disk-1
ext4

At destination listing 640000...
At destination listing 650000...
At destination listing 660000...
At destination listing 670000...
At destination listing 680000...
At destination listing 690000...
At destination listing 700000...
At destination listing 710000...
At destination listing 720000...
At destination listing 730000...
At destination listing 740000...
At destination listing 750000...
At destination listing 760000...
At destination listing 770000...
At destination listing 780000...
At destination listing 790000...
At destination listing 800000...
At destination listing 810000...
At destination listing 820000...
At destination listing 830000...
At destination listing 840000...
At destination listing 850000...
At destination listing 860000...
At destination listing 870000...
At destination listing 880000...
Starting synchronization...
Copying gs://spotty-heareval-d39m1cyg8zdm-europe-west4/projec
fd718af0a522: Waiting
ffd8d2e41dd5: Waiting
31df848d3cfc: Pulling fs layer
cb1b5a748fca: Waiting
b4eeafcafe38: Waiting
4f4fb700ef54: Pulling fs layer
22a020d91470: Pulling fs layer
b96b1f920bc2: Waiting
b49768bdc151: Pulling fs layer
9055f1b67372: Waiting
b4ebd545da9e: Waiting
1ecb75760dad: Waiting
a66aaf282e0f: Pulling fs layer
31df848d3cfc: Waiting
0d29c87088e8: Pulling fs layer
cb4601dc80f0: Waiting
b49768bdc151: Waiting
a66aaf282e0f: Waiting
0d29c87088e8: Waiting
c37d796e6d5f: Pulling fs layer
31ba2cb05766: Waiting
6ff9d09e1219: Pulling fs layer
bbd96c0ec1ae: Pulling fs layer
6ff9d09e1219: Waiting
c37d796e6d5f: Waiting
47061572d235: Verifying Checksum
47061572d235: Download
142d1f51202b: Pull complete
be63cf966a71: Pull complete
fcf8dbce2ffa: Pull complete
f249b9a57a97: Pull complete
c9dba10588ba: Pull complete
6327f62e21a4: Pull complete
8fa100cc0e63: Pull complete
b8941442cede: Pull complete
788df75efffd: Pull complete
9956b4631a70: Pull complete
613e8fb47b11: Pull complete
68297c525da0: Pull complete
ec4c842b9cbf: Pull complete
c7b14aad8447: Pull complete
28bb32c16dbe: Pull complete
ffd8d2e41dd5: Pull complete
fd718af0a522: Pull complete
cb1b5a748fca: Pull complete
1698eb990c48: Pull complete
b4eeafcafe38: Pull complete
82e225063d60: Pull complete
de66f7b3f8cb: Pull complete
1ecb75760dad: Pull complete
9055f1b67372: Pull complete
31ba2cb05766: Pull complete
b4ebd545da9e: Pull complete
b96b1f920bc2: Pull complete
cb4601dc80f0: Pull complete
31df848d3cfc: Pull complete
4f4fb700ef54: Pull complete
22a020d91470: Pull complete
b49768bdc151: Pull complete
a66aaf282e0f: Pull complete
0d29c87088e8: Pull complete
c37d796e6d5f: Pull complete
6ff9d09e1219: Pull complete
bbd96c0ec1ae: Pull complete
Digest: sha256:d49aa3061dee04a36fb84701e5f913d8c7d61964315153bcf08b5a77dc7bb564
Status: Downloaded newer image for turian/heareval:latest
docker.io/turian/heareval:latest
+ '[' 0 -ne 125 ']'
+ break
+ '[' 0 -ne 0 ']'
+ printf 'Starting container... '
Starting container... ++ nvidia-smi
++ echo '--gpus all'
+ docker run --gpus all -td --shm-size 20G -p 6006:6006 -p 8888:8888 -v /mnt/heareval-heareval-i1-joseph-workspace:/workspace:rw --name spotty-heareval-heareval-i1-joseph-default turian/heareval /bin/sh
docker: error during connect: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create?name=spotty-heareval-heareval-i1-joseph-default": EOF.
See 'docker run --help'.
apls777 commented 2 years ago

Does it work for you if you're running this image on a VM without GPUs? Or can you, please, reproduce this error using a machine with 1 GPU only and send me your spotty.yaml? (I have a limit of 1 GPU on my GCP account)

turian commented 2 years ago

@apls777 Okay so I tried this again with one GPU.

I did spotty start, then tried spotty start -C, but still pane is dead.

Here is the one GPU spotty.yaml I am trying to use:

project:
  name: heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - embeddings/*
        - .mypy_cache/*
        - lightning_logs/*
        - logs/*
        - heareval.egg-info/*
        - pretrained/*
        - wandb/*

containers:
  - projectDir: /workspace/project
    #file: docker/Dockerfile-cuda11.2
    image: turian/heareval
    #image: turian/heareval:cuda11.2
    ports:
      # TensorBoard
      - containerPort: 6006
        hostPort: 6006
      # Jupyter
      - containerPort: 8888
        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']
instances:
  - name: heareval-i1-joseph
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-a
      machineType: n1-highmem-4
      #machineType: n1-highmem-8
      #machineType: a2-highgpu-1g
      preemptibleInstance: False
      gpu:
        type: nvidia-tesla-v100
        count: 1
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      ports: [6006, 8888]
      volumes:
        - name: workspace
          parameters:
            size: 250

Here's the end of the log from spotty sh -H:

+ /tmp/spotty/instance/scripts/startup/01_prepare_instance.sh
+ apt-get install -y jq
Reading package lists...
Building dependency tree...
Reading state information...
jq is already the newest version (1.5+dfsg-2+b1).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
+ echo 'bind-key x kill-pane'
++ dirname /tmp/spotty/instance/scripts/container_bash.sh
+ mkdir -p /tmp/spotty/instance/scripts
+ cat
+ chmod +x /tmp/spotty/instance/scripts/container_bash.sh
+ CONTAINER_BASH_ALIAS=container
+ echo 'alias container="/tmp/spotty/instance/scripts/container_bash.sh"'
+ echo 'alias container="/tmp/spotty/instance/scripts/container_bash.sh"'
+ command -v nvidia-smi
+ DRIVER_INSTALLER_PATH=/opt/deeplearning/install-driver.sh
+ '[' -f /opt/deeplearning/install-driver.sh ']'
+ /opt/deeplearning/install-driver.sh
install linux headers: linux-headers-4.19.0-17-cloud-amd64

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
linux-headers-4.19.0-17-cloud-amd64 is already the newest version (4.19.194-3).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
DRIVER_VERSION: 460.73.01
Downloading driver from GCS location and install: gs://nvidia-drivers-us-public/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.73.01...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.

+ mkdir -pm 777 /tmp/spotty
+ mkdir -pm 777 /tmp/spotty/containers
+ /tmp/spotty/instance/scripts/startup/02_mount_volumes.sh
+ DEVICE_NAMES=("disk-1")
+ MOUNT_DIRS=("/mnt/heareval-heareval-i1-joseph-workspace")
+ for i in ${!DEVICE_NAMES[*]}
+ DEVICE=/dev/disk/by-id/google-disk-1
+ MOUNT_DIR=/mnt/heareval-heareval-i1-joseph-workspace
+ blkid -o value -s TYPE /dev/disk/by-id/google-disk-1
ext4
+ mkdir -p /mnt/heareval-heareval-i1-joseph-workspace
+ mount /dev/disk/by-id/google-disk-1 /mnt/heareval-heareval-i1-joseph-workspace
+ chmod 777 /mnt/heareval-heareval-i1-joseph-workspace
+ resize2fs /dev/disk/by-id/google-disk-1
resize2fs 1.44.5 (15-Dec-2018)
The filesystem is already 65536000 (4k) blocks long.  Nothing to do!

+ /tmp/spotty/instance/scripts/startup/03_set_docker_root.sh
+ '[' -n '' ']'
+ /tmp/spotty/instance/scripts/startup/04_sync_project.sh
+ '[' -n /mnt/heareval-heareval-i1-joseph-workspace/project ']'
+ mkdir -p /mnt/heareval-heareval-i1-joseph-workspace/project
+ chmod 777 /mnt/heareval-heareval-i1-joseph-workspace/project
+ gsutil -m rsync -r -x '^(\.idea/.*|\.git/.*|(?=(?P<g1>.*?/__pycache__/))(?P=g1).*|_workdir/.*|tasks/.*|embeddings/.*|\.mypy_cache/.*|lightning_logs/.*|logs/.*|heareval\.egg\-info/.*|pretrained/.*|wandb/.*)$' gs://spotty-heareval-d39m1cyg8zdm-europe-west4/project /mnt/heareval-heareval-i1-joseph-workspace/project
Building synchronization state...
At destination listing 10000...
At destination listing 20000...
At destination listing 30000...
At destination listing 40000...
At destination listing 50000...
At destination listing 60000...
At destination listing 70000...
...
At destination listing 670000...
At destination listing 680000...
At destination listing 690000...
At destination listing 700000...
At destination listing 710000...
At destination listing 720000...
At destination listing 730000...
At destination listing 740000...
At destination listing 750000...
At destination listing 760000...
At destination listing 770000...
At destination listing 780000...
At destination listing 790000...
At destination listing 800000...
At destination listing 810000...
At destination listing 820000...
At destination listing 830000...
At destination listing 840000...
At destination listing 850000...
At destination listing 860000...
At destination listing 870000...
At d
1698eb990c48: Waiting
4f4fb700ef54: Waiting
b4eeafcafe38: Waiting
82e225063d60: Waiting
22a020d91470: Waiting
de66f7b3f8cb: Waiting
b49768bdc151: Waiting
a66aaf282e0f: Waiting
0d29c87088e8: Waiting
c37d796e6d5f: Waiting
bbd96c0ec1ae: Waiting
7cd228bee5c5: Waiting
47061572d235: Verifying Checksum
47061572d235: Download complete
78b5b046c0b0: Verifying Checksum
78b5b046c0b0: Download complete
e4ca327ec0e7: Verifying Checksum
e4ca327ec0e7: Download complete
7cd228bee5c5: Verifying Checksum
7cd228bee5c5: Download complete
a1148b476581: Verifying Checksum
a1148b476581: D
142d1f51202b: Pull complete
be63cf966a71: Pull complete
fcf8dbce2ffa: Pull complete
f249b9a57a97: Pull complete
c9dba10588ba: Pull complete
6327f62e21a4: Pull complete
8fa100cc0e63: Pull complete
b8941442cede: Pull complete
788df75efffd: Pull complete
9956b4631a70: Pull complete
613e8fb47b11: Pull complete
68297c525da0: Pull complete
ec4c842b9cbf: Pull complete
c7b14aad8447: Pull complete
28bb32c16dbe: Pull complete
unexpected EOF
+ PULL_EXIT_CODE=1
+ '[' 1 -ne 125 ']'
+ break
+ '[' 1 -ne 0 ']'
+ exit 1
apls777 commented 2 years ago

Looks like maybe it's not enough memory to start this docker image or something like that. I get the same error, but if I do spotty start -C afterward, it works. I'll have another look tomorrow.

apls777 commented 2 years ago

@turian I used n1-highmem-8 machine and increased --shm-size to 50G and it worked.

turian commented 2 years ago

@apls777 Interesting. Is there a way for spotty to give more detailed errors in this case?

turian commented 2 years ago

@apls777 I followed your instructions, but I still have the same issue. Here is my updated spotty.yaml file, is this what yours looks like?

project:
  name: heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - embeddings/*
        - .mypy_cache/*
        - lightning_logs/*
        - logs/*
        - heareval.egg-info/*
        - pretrained/*
        - wandb/*

containers:
  - projectDir: /workspace/project
    #file: docker/Dockerfile-cuda11.2
    image: turian/heareval
    #image: turian/heareval:cuda11.2
    ports:
      # TensorBoard
      - containerPort: 6006
        hostPort: 6006
      # Jupyter
      - containerPort: 8888
        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '50G']
instances:
  - name: heareval-i1-joseph
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-a
      #machineType: n1-highmem-4
      machineType: n1-highmem-8
      #machineType: a2-highgpu-1g
      preemptibleInstance: False
      gpu:
        type: nvidia-tesla-v100
        count: 1
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      ports: [6006, 8888]
      volumes:
        - name: workspace
          parameters:
            size: 250

and the end of the log:

1ecb75760dad: Download complete
9055f1b67372: Verifying Checksum
9055f1b67372: Download complete
31ba2cb05766: Verifying Checksum
31ba2cb05766: Download complete
9e61b31593ca: Pull complete
b4ebd545da9e: Verifying Checksum
b4ebd545da9e: Download complete
b96b1f920bc2: Download complete
82e225063d60: Verifying Checksum
82e225063d60: Download complete
cb4601dc80f0: Verifying Checksum
cb4601dc80f0: Download complete
4f4fb700ef54: Verifying Checksum
22a020d91470: Verifying Checksum
22a020d91470: Download complete
b49768bdc151: Verifying Checksum
b49768bdc151: Download complete
a66aaf282e0f: Verifying Checksum
a66aaf282e0f: Download complete
0d29c87088e8: Verifying Checksum
0d29c87088e8: Download complete
c37d796e6d5f: Verifying Checksum
c37d796e6d5f: Download complete
6ff9d09e1219: Verifying Checksum
6ff9d09e1219: Download complete
bbd96c0ec1ae: Download complete
32680e958fcb: Pull complete
1585ca522578: Pull complete
ecb2a50cdff9: Pull complete
17f1afbc84d8: Pull complete
31df848d3cfc: Verifying Checksum
31df848d3cfc: Download complete
70567fad67e7: Pull complete
d770d5f8ec7d: Pull complete
142d1f51202b: Pull complete
be63cf966a71: Verifying Checksum
be63cf966a71: Download complete
be63cf966a71: Pull complete
unexpected EOF
+ PULL_EXIT_CODE=1
+ '[' 1 -ne 125 ']'
+ break
+ '[' 1 -ne 0 ']'
+ exit 1
turian commented 2 years ago

I notice that when I change the instance name, it works. Maybe this is because it creates a new disk? Why would that help?

[the old instance disk is pretty old and is probably a few docker images old]

apls777 commented 2 years ago

Here is my updated spotty.yaml file, is this what yours looks like?

Yes, mine was exactly the same

I notice that when I change the instance name, it works. Maybe this is because it creates a new disk? Why would that help?

Not sure why would that help tbh. Does it work for you everytime now? Or do you need to rename the instance every time you start it again?

apls777 commented 2 years ago

@turian Now it started failing for me every time, trying to find the issue

apls777 commented 2 years ago

@turian Apparently, there is an issue with the GCP image. I tried to use the latest one: projects/ml-images/global/images/c0-deeplearning-common-cu113-v20211105-debian-10 and now it works for me again. You can also revert back a smaller VM type and the --shm-size parameter. Try to test it from your side, please.

turian commented 2 years ago

@apls777 Huh! That worked!

I am curious how you suspected there was an issue with the GCP image? How did you predict/diagnose this as a potential cause?

apls777 commented 2 years ago

Well, using the docker ps -a command I checked that the container exits on its own with the 137 error which usually means OOM. At first, I suspected that it's a heavy Docker image that requires a lot of memory, but then I tried the tensorflow/tensorflow image instead - it gave me the same error. At that point, it was clear that it's not OOM. Then I tried to google what else it could be and found this issue where people had the same problem a couple of years ago. They solved it by updating containerd to the latest version, so I tried a newer version as well and it worked :).

Glad it worked for you, I'm closing the issue then. Feel free to reopen if it happens again.