mozilla / snakepit

Machine learning job scheduler
Mozilla Public License 2.0
51 stars 16 forks source link

Could not get lock /var/lib/dpkg/lock-frontend #140

Closed JRMeyer closed 5 years ago

JRMeyer commented 5 years ago

@tilmankamp this error kills my jobs seemingly at random, and it happens so frequently it really slows down my workflow.

The job doesn't die instantaneously, but only after a minute or more. This means that I have to go back and re-run the job and re-set all hyper-parameters I'm testing... which is a pain.

josh@carbon:~/git/DeepSpeech$ pit log 4098
[2019-02-28 23:41:24] [prepare] Preparation started...
[2019-02-28 23:41:24] [prepare] + set -o pipefail
[2019-02-28 23:41:24] [prepare] + mkdir /data/pits/4098/tmp
[2019-02-28 23:41:24] [prepare] + '[' -n '' ']'
[2019-02-28 23:41:24] [prepare] + mkdir /data/pits/4098/keep
[2019-02-28 23:41:24] [prepare] + job_src_dir=/data/pits/4098/src
[2019-02-28 23:41:24] [prepare] + '[' -f /data/pits/4098/origin ']'
[2019-02-28 23:41:24] [prepare] + origin=https://github.com/mozilla/DeepSpeech.git
[2019-02-28 23:41:24] [prepare] + '[' -f /data/pits/4098/hash ']'
[2019-02-28 23:41:24] [prepare] + hash=059408428195ac2b1a4d6ada0bbc4e9c09c2c7aa
[2019-02-28 23:41:24] [prepare] + archive=/data/pits/4098/archive.tar.gz
[2019-02-28 23:41:24] [prepare] + '[' -n https://github.com/mozilla/DeepSpeech.git ']'
[2019-02-28 23:41:24] [prepare] + mkdir -p /data/cache
[2019-02-28 23:41:24] [prepare] ++ echo -n https://github.com/mozilla/DeepSpeech.git
[2019-02-28 23:41:24] [prepare] ++ md5sum
[2019-02-28 23:41:24] [prepare] ++ cut -f1 '-d '
[2019-02-28 23:41:24] [prepare] + cache_entry=05a1655c8cd6700297312ea41c6dbb40
[2019-02-28 23:41:24] [prepare] + cache_repo=/data/cache/05a1655c8cd6700297312ea41c6dbb40
[2019-02-28 23:41:24] [prepare] + '[' -d /data/cache/05a1655c8cd6700297312ea41c6dbb40 ']'
[2019-02-28 23:41:24] [prepare] + git -C /data/cache/05a1655c8cd6700297312ea41c6dbb40 fetch --all
[2019-02-28 23:41:25] [prepare] + touch /data/cache/05a1655c8cd6700297312ea41c6dbb40
[2019-02-28 23:41:25] [prepare] + cp -r /data/cache/05a1655c8cd6700297312ea41c6dbb40 /data/pits/4098/src
[2019-02-28 23:41:38] [prepare] + cd /data/pits/4098/src
[2019-02-28 23:41:38] [prepare] + git reset --hard 059408428195ac2b1a4d6ada0bbc4e9c09c2c7aa
[2019-02-28 23:41:39] [prepare] HEAD is now at 0594084 wheel
[2019-02-28 23:41:39] [prepare] + git lfs pull
Git LFS: (2 of 2 files) 149.34 KB / 149.34 KB                                  
[2019-02-28 23:41:40] [prepare] + cd /data/pits/4098/src
[2019-02-28 23:41:40] [prepare] + patch_file=/data/pits/4098/git.patch
[2019-02-28 23:41:40] [prepare] + '[' -f /data/pits/4098/git.patch ']'
[2019-02-28 23:41:40] [prepare] + cat /data/pits/4098/git.patch
[2019-02-28 23:41:40] [prepare] + patch -p0
[2019-02-28 23:41:40] [prepare] patching file .compute
[2019-02-28 23:41:40] [prepare] patching file evaluate.py
[2019-02-28 23:41:40] [prepare] + echo 'Preparation done.'
[2019-02-28 23:41:40] [prepare] Preparation done.
[2019-02-28 22:41:53] [daemon] Pit daemon started
[2019-02-28 22:42:05] [worker 0] Worker 0 started
[2019-02-28 22:42:05] [worker 0] Preparing script execution...
[2019-02-28 22:42:05] [worker 0] Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
[2019-02-28 22:42:05] [worker 0] Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
[2019-02-28 22:42:05] [worker 0] Get:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
[2019-02-28 22:42:06] [worker 0] Ign:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[2019-02-28 22:42:06] [worker 0] Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
[2019-02-28 22:42:06] [worker 0] Get:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [564 B]
[2019-02-28 22:42:06] [worker 0] Get:7 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [819 B]
[2019-02-28 22:42:06] [worker 0] Get:8 http://archive.ubuntu.com/ubuntu bionic-updates/main Sources [249 kB]
[2019-02-28 22:42:06] [worker 0] Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/universe Sources [133 kB]
[2019-02-28 22:42:06] [worker 0] Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [531 kB]
[2019-02-28 22:42:06] [worker 0] Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main Translation-en [198 kB]
[2019-02-28 22:42:06] [worker 0] Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [737 kB]
[2019-02-28 22:42:06] [worker 0] Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/universe Translation-en [189 kB]
[2019-02-28 22:42:06] [worker 0] Get:14 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [54.8 kB]
[2019-02-28 22:42:06] [worker 0] Get:15 http://security.ubuntu.com/ubuntu bionic-security/main Sources [76.5 kB]
[2019-02-28 22:42:06] [worker 0] Get:16 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [270 kB]
[2019-02-28 22:42:07] [worker 0] Get:17 http://security.ubuntu.com/ubuntu bionic-security/main Translation-en [101 kB]
[2019-02-28 22:42:07] [worker 0] Get:18 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [126 kB]
[2019-02-28 22:42:07] [worker 0] Get:19 http://security.ubuntu.com/ubuntu bionic-security/universe Translation-en [71.4 kB]
[2019-02-28 22:42:10] [worker 0] Fetched 2992 kB in 2s (1670 kB/s)
[2019-02-28 22:42:11] [worker 0] Reading package lists...
[2019-02-28 22:42:11] [worker 0] Starting script...
[2019-02-28 22:42:11] [worker 0] + _LANG=cy
[2019-02-28 22:42:11] [worker 0] + CV=/data/ro/shared/data/mozilla/CommonVoice/v2.0-alpha2.0/cy
[2019-02-28 22:42:11] [worker 0] + apt-get install -y python3-venv
[2019-02-28 22:42:11] [worker 0] E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
[2019-02-28 22:42:11] [worker 0] E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
[2019-02-28 22:42:11] [worker 0] Worker 0 ended with exit code 100
[2019-02-28 22:42:12] [daemon] Worker 0 requested stop. Stopping pit...
[2019-02-28 22:42:13] [daemon] Worker 0 requested stop. Stopping pit...
[2019-02-28 22:42:14] [daemon] Worker 0 requested stop. Stopping pit...
[2019-02-28 22:42:15] [daemon] Worker 0 requested stop. Stopping pit...
[2019-02-28 23:42:17] [clean] Cleaning started...
[2019-02-28 23:42:17] [clean] + rm -rf /data/pits/4098/tmp
[2019-02-28 23:42:17] [clean] + rm -rf /data/pits/4098/src
[2019-02-28 23:42:20] [clean] + echo 'Cleaning done.'
[2019-02-28 23:42:20] [clean] Cleaning done.
tilmankamp commented 5 years ago

I think I'll implement a wrapper like the one from this answer: https://askubuntu.com/a/375031 ...

JRMeyer commented 5 years ago

just got this error in job 4166

[2019-03-04 06:17:49] [worker 0] + apt-get update -y
[2019-03-04 06:17:49] [worker 0] Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
[2019-03-04 06:17:49] [worker 0] Hit:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
[2019-03-04 06:17:49] [worker 0] Hit:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
[2019-03-04 06:17:49] [worker 0] Ign:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
[2019-03-04 06:17:49] [worker 0] Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [564 B]
[2019-03-04 06:17:49] [worker 0] Get:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [819 B]
[2019-03-04 06:17:49] [worker 0] Hit:7 http://security.ubuntu.com/ubuntu bionic-security InRelease
[2019-03-04 06:17:50] [worker 0] Err:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg
[2019-03-04 06:17:50] [worker 0]   The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:50] [worker 0] Fetched 1383 B in 1s (1043 B/s)
[2019-03-04 06:17:51] [worker 0] Reading package lists...
[2019-03-04 06:17:51] [worker 0] W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release: The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:51] [worker 0] W: Failed to fetch http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/Release.gpg  The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:51] [worker 0] W: Some index files failed to download. They have been ignored, or old ones used instead.
[2019-03-04 06:17:51] [worker 0] + apt-get install -y python3-venv
[2019-03-04 06:17:51] [worker 0] E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
[2019-03-04 06:17:51] [worker 0] E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
[2019-03-04 06:17:51] [worker 0] Worker 0 ended with exit code 100
tilmankamp commented 5 years ago

This is hard to debug, as it is very intermittent. I've to figure a way to reproduce it reliably.

tilmankamp commented 5 years ago

Now stopping apt processes before script execution. Let's see, if this happens again. https://github.com/mozilla/snakepit/commit/b361786f9285285461b31231ce51cfe226ef7792

JRMeyer commented 5 years ago

@tilmankamp - it happened again (job 4314)

[2019-03-19 00:52:03] [worker 0] Starting script...
[2019-03-19 00:52:04] [worker 0] + apt-get install -y python3-venv
[2019-03-19 00:52:04] [worker 0] E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
[2019-03-19 00:52:04] [worker 0] E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
[2019-03-19 00:52:04] [worker 0] Worker 0 ended with exit code 100
[2019-03-19 00:52:05] [daemon] Worker 0 requested stop. Stopping pit...
JRMeyer commented 5 years ago

happened in job 4333 and then 4336 --- these were both single GPU jobs

UPDATE: I had a apt-get install call before I set HTTP_PROXY:

# check HTTP_PROXY
if ! (( $( env | grep -iq "^http_proxy=" ) )); then
    source /etc/profile
fi

I moved the apt-get install after this if condition, and now it seems to work. I think I ran a git merge and the apt-get install snuck in. So far the job is running.

tilmankamp commented 5 years ago

It's very unlikely that the lock-issue is result of a missing proxy config. I think the actual problem persists.

What do you mean with "I think I ran a git merge and the apt-get install snuck in"?

tilmankamp commented 5 years ago

Added 10 seconds delay before script execution. Let's see, how this plays out.

tilmankamp commented 5 years ago

The delay seems to have fixed it.

khalid5454 commented 5 years ago

check this link it will work for you

https://itsfoss.com/could-not-get-lock-error/

khalid5454 commented 5 years ago

I try and solve the same error