Closed giventocode closed 6 years ago
Having a similiar issue when creating a cluster. My nodes are unusable.
I checked the startup error logs in /mnt/batch/tasks/startup/error.json and it says:
{"Code":"nvidiaConfigFail","Message":"NVIDIA GPU configuration failed unexpectedly","Category":"InternalError","ExitCode":1,"Details":[{"Key":"Reason","Value":"Failed to install nvidia-docker"}]}
Similarly, from /mnt/batch/tasks/startup/stderr.txt
(...)
018/04/03 00:40:50 install nvidia-docker
dpkg: dependency problems prevent configuration of nvidia-docker:
nvidia-docker2 (2.0.2+docker17.12.0-1) breaks nvidia-docker and is installed.
dpkg: error processing package nvidia-docker (--install):
dependency problems - leaving unconfigured
Errors were encountered while processing:
nvidia-docker
Sounds like the same issue. Still looking for suggestions.
Hi the issue was related with the new version of DSVM. We are rolling out a fix for this.
The fix is out. Sorry for inconvenience
Can anyone confirm this is actually resolved? If so what do I have to do to fix my cluster?
@CameronVetter Please recreate your Batch AI cluster to get the fix for this
I am trying to run the python CNTK distributed GPU recipe. And the job is failing with the following error:
Job state: queued ExitCode: None Cluster state: AllocationState.steady Target: 2; Allocated: 2; Idle: 0; Unusable: 2; Running: 0; Preparing: 0; Leaving: 0 Cluster error: nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly Details: Reason: Failed to install nvidia-docker
Any suggestion?