microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Batch AI Job fails with BFSMountError Info:Failed to install cifs-utils package #56

Closed spdjudd closed 6 years ago

spdjudd commented 6 years ago

Background:

I set up an experiment based on the recipes here, but using the linux-data-science-vm-ubuntu image on my cluster to run a scikit learn job. This had been working fine for the last week or two up to yesterday (1st Aug).

Problem: Now when my job starts on a node it fails before running any of my code with the following:

Job state: failed ExitCode: 1
FailureDetails: 
ErrorCode:BFSMountError
ErrorMessage:unable to mount blob fuse file system
Details:
Info:Failed to install cifs-utils package

I've tried recreating everything in a different subscription with the same result. Similar jobs, which mount the same blob and file shares, that run on a GPU cluster with the default VM image and a Tensorflow docker container still work ok, and I also tried an earlier version of linux-data-science-vm-ubuntu image to no avail.

Any ideas?

spdjudd commented 6 years ago

Here's the stderr.txt from the node - permission denied trying to install cifs-utils in job environment preparation:

2018/08/02 12:11:44 Ping Docker returned: 0xc4200764e0
2018/08/02 12:11:44 Removed all existing containers
2018/08/02 12:11:44 Unmounting previous job level file systems
2018/08/02 12:11:44 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/batchaiworkspace/treebagger_experiment/treebagger_08_02_2018_121135/config
2018/08/02 12:11:44 Version: 3.0.00573.0001 Branch: merge Commit: 24e1bd42
2018/08/02 12:11:44 Running required HostTool version, skipping auto-update
2018/08/02 12:11:44 Executing 'Copy hosttool executable' on 10.0.0.4
2018/08/02 12:11:44 Copy hosttool executable succeeded on 10.0.0.4. Output:
>>>
>>>
2018/08/02 12:11:44 Executing 'job environment preparation' on 10.0.0.4
2018/08/02 12:12:09 job environment preparation failed on 10.0.0.4. Output:
>>>   2018/08/02 12:11:44 Version: 3.0.00573.0001 Branch: merge Commit: 24e1bd42
>>>   2018/08/02 12:11:44 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/batchaiworkspace/treebagger_experiment/treebagger_08_02_2018_121135/wd
>>>   2018/08/02 12:11:44 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/batchaiworkspace/treebagger_experiment/treebagger_08_02_2018_121135/config
>>>   2018/08/02 12:11:44 Mounting job level file systems
>>>   2018/08/02 12:11:44 Creating directory /mnt/batch/tasks/shared/LS_root/jobs/batchaiworkspace/treebagger_experiment/treebagger_08_02_2018_121135/mounts
>>>   2018/08/02 12:11:44 No NFS configured
>>>   2018/08/02 12:11:44 Executing dpkg --configure -a; apt-get install -y -q --no-install-recommends cifs-utils
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:11:44 retrying ...
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:11:45 retrying ...
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:11:47 retrying ...
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:11:51 retrying ...
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:11:59 retrying ...
>>>   dpkg: error: requested operation requires superuser privilege
>>>   E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
>>>   E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
>>>   2018/08/02 12:12:09 Giving up execution of dpkg --configure -a; apt-get install -y -q --no-install-recommends cifs-utils. Last error: exit status 100
>>>   2018/08/02 12:12:09 Reporting an error: InternalError - unable to mount blob fuse file system:{
>>>   Info: Failed to install cifs-utils package
>>>   }
>>>   2018/08/02 12:12:09 Failed to mount job level filesystems: InternalError - unable to mount blob fuse file system:{
>>>   Info: Failed to install cifs-utils package
>>>   }
>>>
2018/08/02 12:12:09 Failed to start the coordination task: InternalError - failed to prepare an environment for the job execution:{
Info: job environment preparation failed on 10.0.0.4.
}
2018/08/02 12:12:09 Executing 'sync mounted file systems' on 10.0.0.4
2018/08/02 12:12:09 sync mounted file systems succeeded on 10.0.0.4. Output:
>>>   2018/08/02 12:12:09 Version: 3.0.00573.0001 Branch: merge Commit: 24e1bd42
>>>
2018/08/02 12:12:09 Executing 'unmount mounted file systems' on 10.0.0.4
2018/08/02 12:12:09 unmount mounted file systems succeeded on 10.0.0.4. Output:
>>>   2018/08/02 12:12:09 Version: 3.0.00573.0001 Branch: merge Commit: 24e1bd42
>>>   2018/08/02 12:12:09 Unmounting /mnt/batch/tasks/shared/LS_root/jobs/batchaiworkspace/treebagger_experiment/treebagger_08_02_2018_121135 with 1m0s timeout
>>>
2018/08/02 12:12:09 Executing 'jobRelease task' on 10.0.0.4
2018/08/02 12:12:09 jobRelease task succeeded on 10.0.0.4. Output:
>>>   2018/08/02 12:12:09 Version: 3.0.00573.0001 Branch: merge Commit: 24e1bd42
>>>   2018/08/02 12:12:09 removing container treebagger_08_02_2018_121135 exited with 1
>>>
llidev commented 6 years ago

@spdjudd Thanks for reporting this. I assume you are using DSVM in North Europe region, aren't you? We are testing some new feature on that the region, and it appears to be a bug.

We just pushed a fix for this. Could you please resubmit your job to verify it?

spdjudd commented 6 years ago

@lliimsft Yes I'm using North Europe, and I've just tested and can confirm it's working again now, so thanks very much for the quick response!!