spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

"ValidationError: Stack with id spotty-instance-my-project-i1 does not exist or has been deleted" #84

Closed turian closed 3 years ago

turian commented 3 years ago

On a new simple spotty project, I have the following error on spotty start. It's very cryptic and hard for me to understand what is wrong:

2020-12-21 19:22:47,946 P3569 [INFO] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2020-12-21 19:22:47,946 P3569 [INFO] Config mount_volumes
2020-12-21 19:22:47,947 P3569 [INFO] ============================================================
2020-12-21 19:22:47,948 P3569 [INFO] Command mount_volumes
2020-12-21 19:22:49,011 P3569 [INFO] -----------------------Command Output-----------------------
2020-12-21 19:22:49,011 P3569 [INFO]    + cfn-signal -e 0 --stack spotty-instance-renderman-dexed-imain --region eu-central-1 --resource MountingVolumesSignal
2020-12-21 19:22:49,011 P3569 [INFO]    + DEVICE_LETTERS=(f g h i j k l m n o p)
2020-12-21 19:22:49,011 P3569 [INFO]    + MOUNT_DIRS=("/workspace" "/docker")
2020-12-21 19:22:49,011 P3569 [INFO]    + for i in '${!MOUNT_DIRS[*]}'
2020-12-21 19:22:49,011 P3569 [INFO]    + MOUNT_DIR=/workspace
2020-12-21 19:22:49,011 P3569 [INFO]    + DEVICE=/dev/xvdf
2020-12-21 19:22:49,011 P3569 [INFO]    + '[' '!' -b /dev/xvdf ']'
2020-12-21 19:22:49,012 P3569 [INFO]    ++ cfn-get-metadata --stack spotty-instance-my-project-i1 --region us-east-2 --resource VolumeAttachmentF -k VolumeId
2020-12-21 19:22:49,012 P3569 [INFO]    ValidationError: Stack with id spotty-instance-my-project-i1 does not exist or has been deleted
2020-12-21 19:22:49,012 P3569 [INFO]    + VOLUME_ID=
2020-12-21 19:22:49,012 P3569 [INFO] ------------------------------------------------------------
2020-12-21 19:22:49,012 P3569 [ERROR] Exited with error code 1

And this is my spotty.yaml:

project:
  name: renderman-dexed
  syncFilters:
    - exclude:
        - '*.sw*'
        - '*.ipynb'
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - '.ipynb_checkpoints/*'
        - '__pycache__/*'
        - 'data/*'
        - 'preprocessed*.tgz*'
        - '*.log'

containers:
  - projectDir: /workspace/project
    file: docker/Dockerfile.spotty
#    ports:
#      # TensorBoard
#      - containerPort: 6006
#        hostPort: 6006
#      # Jupyter
#      - containerPort: 8888
#        hostPort: 8888
#      # Luigi
#      - containerPort: 8125
#        hostPort: 8125
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: imain
    provider: aws
    parameters:
      region: eu-central-1
      instanceType: m5ad.24xlarge
      spotInstance: True
#      ports: [6006, 8888]
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 1000
            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            size: 20
            mountDir: /docker

What does the error mean?

I tried to tmux into the instance, and ran spotty start -C which seemed to work, but then when I tmux in it tells me to run spotty start -C again.

turian commented 3 years ago

This is what I see when after I run spotty start -C:

Container was successfully started.
Use the "spotty sh" command to connect to the container.

but when I spotty sh I see:

Use the "spotty start -C" command to start it.

...

Pane is dead
[spotty-sh0:bash*
snowsky commented 3 years ago

It seems this command having fixed stack name and region needs an update: https://github.com/spotty-cloud/spotty/blob/2cc26389f8d3e7d901814d79ab5e1fce02871b3a/spotty/providers/aws/cfn_templates/instance/data/startup_scripts/02_mount_volumes.sh#L16.

apls777 commented 3 years ago

Thank you, @snowsky! It's fixed now in v1.3.2.