nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly

giventocode commented 6 years ago

I am trying to run the python CNTK distributed GPU recipe. And the job is failing with the following error:

Job state: queued ExitCode: None Cluster state: AllocationState.steady Target: 2; Allocated: 2; Idle: 0; Unusable: 2; Running: 0; Preparing: 0; Leaving: 0 Cluster error: nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly Details: Reason: Failed to install nvidia-docker

Any suggestion?

alexandremuzio commented 6 years ago

Having a similiar issue when creating a cluster. My nodes are unusable.

I checked the startup error logs in /mnt/batch/tasks/startup/error.json and it says:

{"Code":"nvidiaConfigFail","Message":"NVIDIA GPU configuration failed unexpectedly","Category":"InternalError","ExitCode":1,"Details":[{"Key":"Reason","Value":"Failed to install nvidia-docker"}]}

Similarly, from /mnt/batch/tasks/startup/stderr.txt

(...)
018/04/03 00:40:50 install nvidia-docker
dpkg: dependency problems prevent configuration of nvidia-docker:
 nvidia-docker2 (2.0.2+docker17.12.0-1) breaks nvidia-docker and is installed.

dpkg: error processing package nvidia-docker (--install):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 nvidia-docker

Sounds like the same issue. Still looking for suggestions.

llidev commented 6 years ago

Hi the issue was related with the new version of DSVM. We are rolling out a fix for this.

AlexanderYukhanov commented 6 years ago

The fix is out. Sorry for inconvenience

CameronVetter commented 6 years ago

Can anyone confirm this is actually resolved? If so what do I have to do to fix my cluster?

llidev commented 6 years ago

@CameronVetter Please recreate your Batch AI cluster to get the fix for this

microsoftarchive / BatchAI

nvidiaConfigFail: NVIDIA GPU configuration failed unexpectedly #39