spotty-cloud / spotty

Training deep learning models on AWS and GCP instances
https://spotty.cloud
MIT License
493 stars 43 forks source link

spotty create-ami not working on 1.1.9 #34

Closed jagin closed 5 years ago

jagin commented 5 years ago

Hi Oleg,

I was trying to start a new project from scratch with 1.1.9 version and was not able to create AMI. At the beginning it was hard to spot what cause the problem because together with the stack removal I was not able to find logs from the Instance creating the AMI (are they removed with the stack?). I was able to catch them before the stack was removed and here is the cause:

2019-03-12 21:05:58,884 P8710 [INFO] + apt-get install -y nvidia-docker2=2.0.3+docker18.09.2-1 2019-03-12 21:05:58,884 P8710 [INFO] Reading package lists... 2019-03-12 21:05:58,884 P8710 [INFO] Building dependency tree... 2019-03-12 21:05:58,885 P8710 [INFO] Reading state information... 2019-03-12 21:05:58,885 P8710 [INFO] Some packages could not be installed. This may mean that you have 2019-03-12 21:05:58,885 P8710 [INFO] requested an impossible situation or if you are using the unstable 2019-03-12 21:05:58,885 P8710 [INFO] distribution that some required packages have not yet been created 2019-03-12 21:05:58,885 P8710 [INFO] or been moved out of Incoming. 2019-03-12 21:05:58,885 P8710 [INFO] The following information may help to resolve the situation: 2019-03-12 21:05:58,885 P8710 [INFO] 2019-03-12 21:05:58,885 P8710 [INFO] The following packages have unmet dependencies: 2019-03-12 21:05:58,885 P8710 [INFO] nvidia-docker2 : Depends: nvidia-container-runtime (= 2.0.0+docker18.09.2-1) but 2.0.0+docker18.09.3-1 is to be installed 2019-03-12 21:05:58,885 P8710 [INFO] E: Unable to correct problems, you have held broken packages. 2019-03-12 21:05:58,885 P8710 [INFO] ------------------------------------------------------------ 2019-03-12 21:05:58,885 P8710 [ERROR] Exited with error code 100

I can see some changes in spotty/data/create_ami.yaml file with nvidia-docker2 which perhaps cause the problem.

I went back to 1.1.8 version and the AMI was created without problem.

I think that the same problem could be with dev-1.2 as I started from it and also wasn't able to create AMI but couldn't find the cause so went back to 'stable' :) version 1.1.9 and down to 1.1.8.

apls777 commented 5 years ago

Hi @jagin,

This was the reason why I released v1.1.9: new of Docker CE was released, but the latest version of NVIDIA Docker required the previous Docker CE. So I started using specific versions of docker-ce and nvidia-docker. It worked before, but it seems I missed something. I'll fix it today evening. Thank you!

apls777 commented 5 years ago

@jagin It's fixed now. Just released version 1.1.10 and updated the "dev-1.2" branch as well.