spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
491 stars 43 forks source link

ValidationError: Resource PreparingInstanceSignal does not exist for stack #107

Closed turian closed 2 years ago

turian commented 2 years ago

I haven't used spotty on AWS for a while. Now, I am getting the following error.

How do I resolve this?

  - preparing the instance...
Error:
------
Stack "spotty-instance-asdf-i1-joseph" was not created.
Please, see the logs for the details:
  /var/folders/k1/rrn6shl5157gtzyl5bbkf5gw0000gn/T/tmpvsmm1v_t/cfn-init-cmd.log

and the log says:

2021-09-30 04:59:40,516 P2122 [INFO] ************************************************************
2021-09-30 04:59:40,516 P2122 [INFO] ConfigSet init
2021-09-30 04:59:40,518 P2122 [INFO] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2021-09-30 04:59:40,518 P2122 [INFO] Config prepare_instance
2021-09-30 04:59:40,521 P2122 [INFO] ============================================================
2021-09-30 04:59:40,521 P2122 [INFO] Command prepare_instance
2021-09-30 04:59:40,807 P2122 [INFO] -----------------------Command Output-----------------------
2021-09-30 04:59:40,808 P2122 [INFO]    + cfn-signal -e 0 --stack spotty-instance-asdf-i1-joseph --region eu-central-1 --resource PreparingInstanceSignal
2021-09-30 04:59:40,808 P2122 [INFO]    ValidationError: Resource PreparingInstanceSignal does not exist for stack spotty-instance-asdf-i1-joseph
2021-09-30 04:59:40,808 P2122 [INFO] ------------------------------------------------------------
2021-09-30 04:59:40,808 P2122 [ERROR] Exited with error code 1
apls777 commented 2 years ago

@turian I didn't see this issue before. Did you try to start the instance again? If it fails with this error all the time, can you please send me your config? Also, can you please check stack errors in CloudFormation?

turian commented 2 years ago

Here is my config:

project:
  name: asdf
  syncFilters:
    - exclude:
        - .idea/*
        - .git/*
        - '*/__pycache__/*'
        - _workdir/*
        - .mypy_cache/*
        - lightning_logs/*
        - logs/*

containers:
  - projectDir: /workspace/project
    image: turian/heareval
    ports:
      # TensorBoard
      - containerPort: 6006
        hostPort: 6006
      # Jupyter
      - containerPort: 8888
        hostPort: 8888
    volumeMounts:
      - name: workspace
        mountPath: /workspace
    runtimeParameters: ['--shm-size', '20G']

instances:
  - name: i1-joseph
    provider: aws
    parameters:
      region: eu-central-1
      instanceType: p3.2xlarge
      spotInstance: True
      ports: [6006, 8888]
      volumes:
        - name: workspace
          parameters:
            size: 500
            deletionPolicy: retain
            mountDir: /workspace

on the AWS management console in CloudFormation, it says the status is "CREATE_COMPLETE".

turian commented 2 years ago

I tried it again, now I am getting this error:

2021-10-01 12:16:39,189 P2135 [INFO]    failed to register layer: Error processing tar file(exit status 1): write /usr/local/lib/python3.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so: no space left on device

I tried again with these volumes but it didn't work, still the disk space error:

      volumes:
        - name: workspace
          parameters:
            size: 500
            deletionPolicy: retain
            mountDir: /workspace
        - name: docker
          parameters:
            mountDir: /docker
            size: 200
apls777 commented 2 years ago

@turian Try to increase root volume size: add rootVolumeSize: 100 to the instance parameters.