spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
493 stars 43 forks source link

Error when trying the default config #66

Closed anmoljagetia closed 4 years ago

anmoljagetia commented 4 years ago
Creating IAM role for the instance...
Preparing CloudFormation template...
  - volume "lsma-hw1-i1-workspace" will be created
  - volume "lsma-hw1-i1-docker" will be created
  - availability zone: auto
  - maximum Spot Instance price: on-demand
  - AMI: "Deep Learning AMI (Ubuntu 16.04) Version 26.0" (ami-025ed45832b817a35)
  - Docker data will be stored on the "docker" volume

Volumes:
+-----------+---------------+------------+-----------------+
| Name      | Container Dir | Type       | Deletion Policy |
+===========+===============+============+=================+
| workspace | /workspace    | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+
| docker    | -             | EBS volume | Retain Volume   |
+-----------+---------------+------------+-----------------+

Waiting for the stack to be created...
  - launching the instance...
  - waiting for the Docker container to be ready...
Error:
------
Stack "spotty-instance-lsma-hw1-i1" was not created.
Please, see CloudFormation logs for the details.

What is the recommended image for PyTorch? I can't seem to find any recommendations? I tried even with the default Tensorflow one, but I get the same error. I guess it's because of the following error, that I get when I look at the Cloudformation logs by running the following on the instance:

 sudo tail /var/log/cfn-init-cmd.log
ubuntu@ip-172-31-84-193:~$  sudo tail /var/log/cfn-init-cmd.log
2020-02-09 01:03:12,195 P1844 [INFO]    + MOUNT_DIRS=("/mnt/lsma-hw1-i1-workspace" "/docker")
2020-02-09 01:03:12,195 P1844 [INFO]    + for i in '${!MOUNT_DIRS[*]}'
2020-02-09 01:03:12,195 P1844 [INFO]    + DEVICE=/dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    + MOUNT_DIR=/mnt/lsma-hw1-i1-workspace
2020-02-09 01:03:12,195 P1844 [INFO]    + blkid -o value -s TYPE /dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    + mkfs -t ext4 /dev/xvdf
2020-02-09 01:03:12,195 P1844 [INFO]    mke2fs 1.42.13 (17-May-2015)
2020-02-09 01:03:12,195 P1844 [INFO]    The file /dev/xvdf does not exist and no size was specified.
2020-02-09 01:03:12,195 P1844 [INFO] ------------------------------------------------------------
2020-02-09 01:03:12,195 P1844 [ERROR] Exited with error code 1

Do you know why this could be happening?

My spotty config is the following:

project:
  name: test-hw1
  syncFilters:
    - exclude:
      - .git/*
      - .idea/*
      - '/_pycache_/'

container:
  projectDir: /workspace/project
  image: tensorflow/tensorflow:latest-gpu-py3-jupyter
  ports: [6006, 8888]
  volumeMounts:
    - name: workspace
      mountPath: /workspace

instances:
  - name: i1
    provider: aws
    parameters:
      region: us-east-1
      instanceType: g4dn.xlarge
      dockerDataRoot: /docker
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain

scripts:
  jupyter: |
    jupyter notebook --allow-root --ip 0.0.0.0 --notebook-dir=/workspace/project
anmoljagetia commented 4 years ago

Also, the same config works with p2.xlarge instance but not g4dn.xlarge. Do you have a list of which instances are supported and which are not?

apls777 commented 4 years ago

There is no recommended image for PyTorch, just use the latest one or whatever suits you.

All G4 instances are Nitro-based instances, and, unfortunately, they're not supported right now (see this issue). Spotty has a hard-coded blacklist of such instance types and supposed to show you an error, but G4 instances are new ones and I didn't add them yet. You can see the full list here: Nitro-based Instances.

apls777 commented 4 years ago

Now Spotty supports Nitro-based instances.