spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

Stuck at "- launching the instance..." #90

Closed aeon0 closed 3 years ago

aeon0 commented 3 years ago

I have no issues creating t2 spot instances e.g. t2.medium. But somehow for p3.2xlarge spotty is stuck at "- launching the instance...".

Here is my config:

project:
  name: cvms
  syncFilters:
    - exclude:
      - .git/*
      - .idea/*
      - '*/__pycache__/*'
      - tmp/*
      - dependencies/*

containers:
  - projectDir: /workspace/computer-vision-models
    hostNetwork: true
    file: Dockerfile
    env:
      PYTHONPATH: /workspace/computer-vision-models
    volumeMounts:
      - name: workspace
        mountPath: /workspace

instances:
  - name: train
    provider: aws
    parameters:
      region: eu-west-2
      instanceType: p3.2xlarge
      availabilityZone: eu-west-2a
      spotInstance: true
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 30
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 30
            mountDir: /docker
            deletionPolicy: retain

scripts:
  train: |
    python models/semseg/train.py
  tensorboard: |
    tensorboard --bind_all --port 6006 --logdir /workspace/computer-vision-models/trained_models

Any idea what could go wrong here?

apls777 commented 3 years ago

This most likely means that there are no available spot instances at the moment. You can try to change the availability zone, region, or run an on-demand instance.

aeon0 commented 3 years ago

On-demand works. But a bit pricy ;)

What confuses me, It does not even get to the point where the actual spot request is created. Also, when requesting a p3.2xlarge spot request manually, the instance is created without problems.

aeon0 commented 3 years ago

Found the issue. My current limit for vCPU on "P Spot Instances" is at 4. p3.2xlarge has 8. E.g. p2.xlarge with 4 vCPU works.

I requested a limit increase and hope that this will resolve this.

Edit, actually I am not sure that was the reason on eu-west-2. But switching to eu-west-1, it seems that is the issue. And for the time being I can at least use p2.xlarge.