spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
492 stars 43 forks source link

Problem with config file containing multiple AWS instances for the same project #75

Closed qraleq closed 3 years ago

qraleq commented 3 years ago

Hi, I'm trying to create a Spotty config file that will contain multiple definitions of AWS instances. Essentially, I want to run larger instances if needed using the exact same config file.

I created definitions of 3 AWS instances i1, i2, and i3, and the only difference between them is instance type p2.xlarge, p3.2xlarge and p3.8xlarge. When I run spotty start i1, the instance i1 is created successfully and all the commands are completed. When I run spotty start i2, I get this error:

Error: Stack "spotty-instance-tf1-od-api-i2" was not created. Please, see CloudFormation logs for the details.

When I connect to the i2 instance using spotty ssh i2, and list the working directory, I get different output than when I do the same thing on i1. On i2 it seems that the directory structure is nested, S3 bucket content is synced in /workspace/detection/research instead of /workspace/detection.

Can you please help me figure out what is wrong and how to resolve this issue?

project:
  name: tf1-od-api
  syncFilters:
    - exclude:
        ...

container:
  projectDir: /workspace/detection
  workingDir: /workspace/detection/research
  file: research/object_detection/dockerfiles/tf1/Dockerfile.spotty
  ports: [6006, 8888]
  volumeMounts:
    - name: workspace
      mountPath: /workspace
  commands: |
    protoc object_detection/protos/*.proto --python_out=.
    cp object_detection/metrics/cocoeval.py /usr/local/lib/python3.6/dist-packages/pycocotools/
    cp /workspace/detection/trains.conf /root
    export PYTHONPATH=$PYTHONPATH:`pwd`/research:`pwd`/research/slim:`pwd`/research/slim/nets:`pwd`/research/object_detection

instances:
  - name: i1
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p2.xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

  - name: i2
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p3.2xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

  - name: i3
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p3.8xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

scripts: ...
qraleq commented 3 years ago

Also, when I try to run the same instance using this config (I isolated only i2 in the config), everything works fine:

project:
  name: tf1-od-api
  syncFilters:
  ...

container:
  projectDir: /workspace/detection
  workingDir: /workspace/detection/research
  file: research/object_detection/dockerfiles/tf1/Dockerfile.spotty
  ports: [6006, 8888]
  volumeMounts:
    - name: workspace
      mountPath: /workspace
  commands: |
    protoc object_detection/protos/*.proto --python_out=.
    cp object_detection/metrics/cocoeval.py /usr/local/lib/python3.6/dist-packages/pycocotools/
    cp /workspace/detection/trains.conf /root
    export PYTHONPATH=$PYTHONPATH:`pwd`/research:`pwd`/research/slim:`pwd`/research/slim/nets:`pwd`/research/object_detection

instances:
  - name: i2
    provider: aws
    parameters:
      region: us-west-2
      instanceType: p3.2xlarge
      dockerDataRoot: /docker
      amiId: ami-045f15870dccff376
      volumes:
        - name: workspace
          parameters:
            size: 50
            deletionPolicy: retain
        - name: docker
          parameters:
            size: 10
            mountDir: /docker
            deletionPolicy: retain
      commands: |
        $(sudo rm /usr/local/cuda)
        $(sudo ln -s /usr/local/cuda-10.0 /usr/local/cuda)

scripts:
    ...
apls777 commented 3 years ago

Hi Ivan,

That's weird. The project should not be synced to /workspace/detection/research, not sure what went wrong. Are you sure you're not copying those files during docker build (using ADD or COPY commands in the Dockerfile)?

I'm actually about to release a new version of Spotty, just lately didn't have time to finish tests for GCP (PR: https://github.com/spotty-cloud/spotty/pull/71). It would be great if you could try it for your project and check if the behavior is the same.

In the new version, the format of the config was slightly changed: https://spotty-cloud.github.io/website/docs/user-guide/configuration-file.html. Now there is a list of containerS instead of a single container, a different format for ports (should be specified in 2 places: for the instance and for the container) and also, by default, Spotty will start an on-demand instance, so use spotInstance: true if you want to start a spot instance. You can find the full list of changes in the PR description.

qraleq commented 3 years ago

@apls777 Sorry for the slow response. I finally managed to test your proposal, and it worked! It seems that the problem I reported has been fixed in the new Spotty version. Thank you very much!