spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
493 stars 43 forks source link

Error while creating IAM role for instance #31

Closed ZohaibAhmed closed 5 years ago

ZohaibAhmed commented 5 years ago

Hi,

Coming back to this project after a while. I'm following the tutorial here: https://towardsdatascience.com/how-to-train-deep-learning-models-on-aws-spot-instances-using-spotty-8d9e0543d365

I get an error while it tries to create AMI.

Tacotron-2 at master ✖ spotty create-ami
Waiting for the AMI to be created...
  - creating IAM role for the instance...
Error:
------
Stack "spotty-nvidia-docker-ami-qgogoanj" was not created.
See CloudFormation and CloudWatch logs for details.
apls777 commented 5 years ago

Hi @ZohaibAhmed,

Can you, please, go to your AWS account -> CloudFormation service, find there the AMI stack (it has name "spotty-nvidia-docker-ami-xxxxxxxx"), open it and check what is the error message in the "Events" tab?

ZohaibAhmed commented 5 years ago

@apls777 It shows that there are no stacks (failed, deleted, or otherwise)? Seems like spotty didn't get around to creating the stack at all?

apls777 commented 5 years ago

@ZohaibAhmed make sure you are looking stacks in the right region, it should be there

ZohaibAhmed commented 5 years ago

@apls777 Got it, was looking at the default region, not the one in the spotty config. Here's the error:

` CREATE_FAILED AWS::Lambda::Function SetLogsRetentionFunction The runtime parameter of nodejs4.3 is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (nodejs8.10) while creating or updating functions. (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: bb4a9a9b-3ab9-11e9-b03f-151ba91a4517)

`

apls777 commented 5 years ago

@ZohaibAhmed it seems you are using old version of the tool. Use the pip install -U spotty command to update it.

ZohaibAhmed commented 5 years ago

Thanks, also seemed like I need to request an increase for the particular instance I was allocating.

apls777 commented 5 years ago

@ZohaibAhmed What do you mean by "request an increase"?

ZohaibAhmed commented 5 years ago

I had a limit on the particular type of instance I wanted to use (by default it was 0). Had to open a case with Amazon to allocate more (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html)

ZohaibAhmed commented 5 years ago

@apls777 Don't want to open another issue on this repo for this, but all of the sudden, I get this error when I create-ami using spotty:

ract_config.py:46: YAMLLoadWarning:
  *** Calling yaml.load() without Loader=... is deprecated.
  *** The default Loader is unsafe.
  *** Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
Waiting for the AMI to be created...
  - creating IAM role for the instance...
  - launching the instance...
  - installing NVIDIA Docker...
Error:
------
Stack "spotty-nvidia-docker-ami-hc4o3367" was not created.
See CloudFormation and CloudWatch logs for details.

image

Any ideas?

EDIT: More logs I dug up:

[  212.678177] cloud-init[2664]: Error occurred during build: Command run_init failed
[  212.690100] cloud-init[2664]: + INIT_EXIT_CODE=1
[  212.690362] cloud-init[2664]: + /usr/local/bin/cfn-signal -e 1 --stack spotty-nvidia-docker-ami-sdrkb2nt --region us-west-2 --resource InstanceReadyWaitCondition
[  212.928861] cloud-init[2664]: + [[ 1 -ne 0 ]]
[  212.929111] cloud-init[2664]: + exit 1
[  212.933173] cloud-init[2664]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'modules:final' at Fri, 01 Mar 2019 05:23:51 +0000. Up 19.41 seconds.
[  212.933355] cloud-init[2664]: 2019-03-01 05:27:04,492 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
[  212.946474] cloud-init[2664]: 2019-03-01 05:27:04,505 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[  212.957059] cloud-init[2664]: 2019-03-01 05:27:04,516 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
ci-info: no authorized ssh keys fingerprints found for user ubuntu.
apls777 commented 5 years ago

@ZohaibAhmed Apparently, new version of Docker CE was released, but NVIDIA Docker depends on the previous version. Here is a new issue. I'll fix this problem today.