spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

"ResourceErrorMessage":"Failure condition satisfied." #103

Closed turian closed 3 years ago

turian commented 3 years ago

What causes this error?

Waiting for the stack to be created...

project:
  name: spotty-heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir
        - embeddings

containers:
  - projectDir: /workspace/project
    image: turian/heareval-v100-cu110
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: spotty-heareval-i1
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-a
      machineType: n1-standard-1
      gpu:
        type: nvidia-tesla-v100
        count: 1
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      # https://github.com/spotty-cloud/spotty/issues/102
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 250
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            size: 200
            mountDir: /docker

# Pick SR!
apls777 commented 3 years ago

Looks like the Docker image failed to run. Are you sure the turian/heareval-v100-cu110 image is publicly available? I cannot find it on the Docker Hub. Otherwise, you need to authorize in your Docker account first using the commands parameter (see how I did it for AWS ECR here).

turian commented 3 years ago

Thanks. I've been playing with different Dockers and these I forgot to make public on docker hub.

Could we make the error message more descriptive somehow?

turian commented 3 years ago

I am reopening because a spotty.yaml that was working all day stopped working very mysteriously. The docker is public:

Waiting for the stack to be created...
  - launching the instance...
  - running the Docker container...
  Error:
  ------
  Deployment "spotty-instance-spotty-heareval-spotty-heareval-i1" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"412","ResourceErrorMessage":"Failure condition satisfied."}
project:
  name: spotty-heareval
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - tasks/*
        - embeddings/*
        - .mypy_cache/*
        - lightning_logs/*
        - heareval.egg-info/*
        - wandb/*

containers:
  - projectDir: /workspace/project
    #file: Dockerfile
    image: turian/heareval
    #image: turian/heareval-v100-cu110
    #image: turian/heareval-v100
    #image: turian/heareval-a100
#    ports:
#      # TensorBoard
#      - containerPort: 6006
#        hostPort: 6006
#      # Jupyter
#      - containerPort: 8888
#        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: spotty-heareval-i1
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-a
      # V100
      machineType: n1-standard-4
      # One CPU seems to exhaust memory and crash
      #machineType: n1-standard-1
      ## A100. only west3 or maybe west4? :\
      #machineType: a2-highgpu-1g
      gpu:
        type: nvidia-tesla-v100
        #type: nvidia-tesla-a100
        count: 1
      # https://console.cloud.google.com/compute/images?project=hear2021-evaluation
      # https://github.com/spotty-cloud/spotty/issues/102
      #imageUri: projects/ml-images/global/images/family/common-dl-gpu-debian-10
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20210818-debian-10
#      spotInstance: True
#      ports: [6006, 8888]
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 250
# Not implemented for GCP, all volumes will be retained
#            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            size: 200
            mountDir: /docker
turian commented 3 years ago

I also get this error if I use my local Dockerfile, not the docker hub image.

apls777 commented 3 years ago

It could be something related to the docker cache. Try to remove "dockerDataRoot: /docker" and the "docker" volume parameters (also the docker volume (disk) itself from the GCP console as Spotty will no longer use it).

turian commented 3 years ago

Okay. I deleted the docker disk on GCP.

"Try to remove "dockerDataRoot: /docker" and the "docker" volume parameters " Do you mean remove it and then add it back? Or if I remove it, how do I get a docker volume?

apls777 commented 3 years ago

No, just remove those parameters from the config. They are optional - it's just helping sometimes to speed up instance launching if you're building a docker image from a Dockerfile on the instance.

apls777 commented 3 years ago

if I remove it, how do I get a docker volume?

You don't need a docker volume. By default, Docker will keep its files on the root volume that is being deleted every time you're restarting an instance.

turian commented 3 years ago

Great that worked!