spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
493 stars 43 forks source link

[feature request] Support several Docker configurations in one spotty.yaml (to better support auxiliary on-demand instances) #44

Closed vadimkantorov closed 4 years ago

vadimkantorov commented 5 years ago

Downloading the dataset on a EBS volume can take many hours. I don't want to use a GPU machine for that.

Unfortunately, currently I cannot instruct spotty to not run any Docker image or specify a Docker image per instance (the main Docker image fails to start on the t2.micro machine).

Currently I'm using a separate spotty_preprocess.yaml to achieve this goal.

apls777 commented 5 years ago

Hi Vadim,

Thank you for this feature request, I actually had a similar problem before. Sometimes you just want to run a CPU instance to do some work that doesn't require GPU, for example, analyze your results with Jupyter notebooks, and then you may need a different image that was compiled for CPU.

I was thinking to extend the configuration file to be able to specify several containers and then use them for the instances. So that instead of the container parameter, you could use the containers parameter and define several named containers:

containers:
  default:
    image: tensorflow/tensorflow:1.14.0-gpu-py3-jupyter
    # ...

  cpu:
    image: tensorflow/tensorflow:1.14.0-py3-jupyter
    # ...

And then in the instance parameters, you can redefine the default container:

instances:
  - name: i1
    provider: aws
    parameters:
      container: cpu
      # ...

As for your case: by design, you're actually not supposed to do any work from the host OS :). All the work should be done through your custom environment - a Docker container. And the -H flag in the spotty ssh command is rather for debugging purposes. With a container, you also could define a custom script in the scripts parameter to download your dataset. Then you will never forget where it's stored and what command to use to download it again.

I believe (but need to double-check) that even a t2.micro instance is able to run some lightweight container that contains only Bash and AWS CLI (or whatever you're using). But, of course, in theory, it's possible to have some option in the config file to launch an instance without a container at all. But even in this scenario, I don't see why you would need another AMI without installed Docker.

Did you try to use some lightweight container with a t2.micro instance? Or are you using your main container with more powerful CPU instance? What do you think about the logic with named containers I described above?

P.S. Next 2 weeks I'm on vacation, so will be able to give proper thought and work on this feature only when I come back :).

Best regards, Oleg

vadimkantorov commented 5 years ago

@apls777 Thanks for the detailed response! I did try using a smaller instance. It even managed to boot up with ubuntu:18.04 container and Deep Learning AMI. Unfortunately it was unbearable slow (too little memory, I guess).

Even with GPU instances, Docker sometimes takes 5-10 minutes to start-up (DockerReadyWait event).

For this usecase of launching a light CPU instance, probably an option to launch a regular On-Demand instance would be useful (downloading a huge dataset on EBS can take many hours).

A few other points I noticed: for some reason you do not support Nitro-based instances. I commented out the check and everything still worked OK. Also the list of instance types that you hardcoded became out-of-date, some instance types are missing.

Also sometimes, Docker just wouldn't start (no useful error messages on CloudFormation). As a remedy I disabled the docker EBS volume, and things run normally then (for the CPU instance it's probably OK).

apls777 commented 5 years ago

It even managed to boot up with ubuntu:18.04 container and Deep Learning AMI. Unfortunately it was unbearable slow (too little memory, I guess).

Was it too slow to work with or it took a lot of time to start-up?

Even with GPU instances, Docker sometimes takes 5-10 minutes to start-up (DockerReadyWait event).

I think it's a "bug" that I found recently as well. If you have a lot of files on one of your EBS volumes, it may take time to start. When Spotty is mounting volumes, it's changing ownership of all files from root to ubuntu which is a completely useless operation. Please, try to remove the line 212 from the instance CF template: https://github.com/apls777/spotty/blob/185968fa26bae14da9127bbd53c05eee6068ec7b/spotty/providers/aws/deployment/cf_templates/data/instance.yaml#L212

The instance should start-up much faster.

for some reason you do not support Nitro-based instances. I commented out the check and everything still worked OK.

Nitro-based instances have different device names for attached EBS volumes: see here vs here. At the moment, a Nitro-based instance would fail to start if you attached any volume. So I decided to disable this functionality at all for now, but it's in the TODO list.

Also the list of instance types that you hardcoded became out-of-date, some instance types are missing.

Thanks for pointing out, I'll try to load this list dynamically using AWS API, otherwise, it's difficult to keep it up-to-date.

Also sometimes, Docker just wouldn't start (no useful error messages on CloudFormation). As a remedy I disabled the docker EBS volume, and things run normally then (for the CPU instance it's probably OK).

Can it be the issue with changing ownership that I described above?

vadimkantorov commented 5 years ago

The t2.micro machine took many minutes to start a Docker and then ssh was super unresponsive. Bigger machines also took long time to start Docker, but then there were no problems with ssh.

About Docker not starting, it's hard to say, CloudFormation log just spits some "unique error ID" which is ungoogleable without an additional error message.

I'll try commenting out the chown line. Thanks!

apls777 commented 5 years ago

Also, do you use your custom Dockerfile or already built image? Because if you're using a Dockerfile, Spotty builds this image every time it starts an instance. Then you may want to consider caching it using a dedicated EBS volume: see here.

vadimkantorov commented 5 years ago

Prebuilt image: just ubuntu:18.04. Somehow caching it using a separate volume seems to have caused the failing docker build (maybe insufficient volume size of 10Gb?) on some instance types. But I haven't thoroughly confirmed this hypothesis.

apls777 commented 5 years ago

Okay, thanks. I'll check what I can do about micro instances once I'll come back from vacation.

Somehow caching it using a separate volume seems to have caused the failing docker build

If you have issues when the instance is launched, but Docker container is not started for some reason, try to connect to the host OS using -H flag and check CloudFormation logs there: /var/log/cfn-init-cmd.log and /var/log/cfn-init.log files.

vadimkantorov commented 5 years ago

Thanks! The errors I referred to were from the CloudFormation web console, maybe the local logs have more information.

vadimkantorov commented 5 years ago

It would be super useful if the CloudFormation logs were automatically downloaded and offered to the user (no hassle with manual ssh'ing as in https://github.com/apls777/spotty/issues/48)

vadimkantorov commented 5 years ago

Btw removing chown was crucial, since my dataset drive contains a terabyte of small audio files and chown'ing them takes forever

My fork is at https://github.com/vadimkantorov/spotty

vadimkantorov commented 4 years ago

Hi @apls777! Any news about this one?

apls777 commented 4 years ago

Hi @vadimkantorov, Unfortunately, I don't have time to work on this feature at the moment, but maybe I will have it in April.

apls777 commented 4 years ago

Added support for multiple container configurations.