Closed JRMeyer closed 5 years ago
I think I'll implement a wrapper like the one from this answer: https://askubuntu.com/a/375031 ...
just got this error in job 4166
[2019-03-04 06:17:49] [worker 0] + apt-get update -y
[2019-03-04 06:17:49] [worker 0] Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
[2019-03-04 06:17:49] [worker 0] Hit:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
[2019-03-04 06:17:49] [worker 0] Hit:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
[2019-03-04 06:17:49] [worker 0] Ign:4 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease
[2019-03-04 06:17:49] [worker 0] Get:5 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release [564 B]
[2019-03-04 06:17:49] [worker 0] Get:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release.gpg [819 B]
[2019-03-04 06:17:49] [worker 0] Hit:7 http://security.ubuntu.com/ubuntu bionic-security InRelease
[2019-03-04 06:17:50] [worker 0] Err:6 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release.gpg
[2019-03-04 06:17:50] [worker 0] The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:50] [worker 0] Fetched 1383 B in 1s (1043 B/s)
[2019-03-04 06:17:51] [worker 0] Reading package lists...
[2019-03-04 06:17:51] [worker 0] W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release: The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:51] [worker 0] W: Failed to fetch http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/Release.gpg The following signatures were invalid: BADSIG F60F4B3D7FA2AF80 cudatools <cudatools@nvidia.com>
[2019-03-04 06:17:51] [worker 0] W: Some index files failed to download. They have been ignored, or old ones used instead.
[2019-03-04 06:17:51] [worker 0] + apt-get install -y python3-venv
[2019-03-04 06:17:51] [worker 0] E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
[2019-03-04 06:17:51] [worker 0] E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
[2019-03-04 06:17:51] [worker 0] Worker 0 ended with exit code 100
This is hard to debug, as it is very intermittent. I've to figure a way to reproduce it reliably.
Now stopping apt processes before script execution. Let's see, if this happens again. https://github.com/mozilla/snakepit/commit/b361786f9285285461b31231ce51cfe226ef7792
@tilmankamp - it happened again (job 4314)
[2019-03-19 00:52:03] [worker 0] Starting script...
[2019-03-19 00:52:04] [worker 0] + apt-get install -y python3-venv
[2019-03-19 00:52:04] [worker 0] E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
[2019-03-19 00:52:04] [worker 0] E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
[2019-03-19 00:52:04] [worker 0] Worker 0 ended with exit code 100
[2019-03-19 00:52:05] [daemon] Worker 0 requested stop. Stopping pit...
happened in job 4333
and then 4336
--- these were both single GPU jobs
UPDATE: I had a apt-get install
call before I set HTTP_PROXY
:
# check HTTP_PROXY
if ! (( $( env | grep -iq "^http_proxy=" ) )); then
source /etc/profile
fi
I moved the apt-get install
after this if condition, and now it seems to work. I think I ran a git merge
and the apt-get install
snuck in. So far the job is running.
It's very unlikely that the lock-issue is result of a missing proxy config. I think the actual problem persists.
What do you mean with "I think I ran a git merge and the apt-get install snuck in"?
Added 10 seconds delay before script execution. Let's see, how this plays out.
The delay seems to have fixed it.
check this link it will work for you
I try and solve the same error
@tilmankamp this error kills my jobs seemingly at random, and it happens so frequently it really slows down my workflow.
The job doesn't die instantaneously, but only after a minute or more. This means that I have to go back and re-run the job and re-set all hyper-parameters I'm testing... which is a pain.