spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
493 stars 43 forks source link

"ResourceErrorMessage":"Failure condition satisfied." #120

Open turian opened 2 years ago

turian commented 2 years ago

Ugh, #103 and #108 appear to be back with the latest images. Can you please help? I believe I have adhered to all the improvements we learned in the previous threads.

spotty.yaml:

project:
  name: sss
  syncFilters:
    - exclude:
        - '*.ipynb'
        - '*.log'
        - '*.sw*'
        - '*/__pycache__/*'
        - '.ipynb_checkpoints/*'
        - '__pycache__/*'
        - .git/*
        - .idea/*
        - .mypy_cache/*
        - lightning_logs/*
        - local.py
        - wandb/*

containers:
  - projectDir: /workspace/project
    image: turian/heareval
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '50G']

instances:
  - name: spotty-sss-i1
    provider: gcp
    parameters:
      # https://cloud.google.com/compute/docs/gpus/gpu-regions-zones
      zone: europe-west4-b
      machineType: n1-standard-4
      preemptibleInstance: True
      gpu:
        type: nvidia-tesla-t4
        count: 1
      imageUri: projects/ml-images/global/images/c0-deeplearning-common-cu110-v20221107-debian-10
      volumes:
        - name: workspace
          parameters:
            size: 250
            mountDir: /workspace

gives

  Error:
  ------
  Deployment "spotty-instance-sss-spotty-sss-i1" failed.
  Error: {"ResourceType":"runtimeconfig.v1beta1.waiter","ResourceErrorCode":"412","ResourceErrorMessage":"Failure condition satisfied."}
turian commented 2 years ago

I tried again using the old imageUri projects/ml-images/global/images/c0-deeplearning-common-cu113-v20211105-debian-10 but the same thing happens.

In this VM, I do spotty sh -H but the /var/log/startup-script.log is not there any more.