build-and-provide does not update cowbuilder environment

mark0n commented 8 years ago

I'm running into issues with build-and-provide not updating the cowbuilder environment. This seems to be caused by a lock file that must have been left behind on an earlier run:

<jenkins5:~ >ls /var/run/lock/jessie-amd64.update -l
-rw-r--r-- 1 jenkins-slave jenkins-slave 0 Jun  1 10:20 /var/run/lock/jessie-amd64.update

I'm not sure why these lock files are left behind but this has happened multiple times on one of our build slaves. But I guess the why doesn't matter too much (it could just be the result of a crashed executor). Anyway, the issue is detected by build-and-provide correctly

15:45:30 + echo '*** Update run already taking place, skipping ***'
15:45:30 *** Update run already taking place, skipping ***

Unfortunately it seems to handle this case in the wrong way. It just prints the warning, skips the cowbuilder update and proceeds. As a consequence I end up with a cowbuilder environment that has not seen an apt-get update run for weeks causing builds against old versions of my libraries.

@mika: I'm still a bit confused by some details of the locking code (the 9> part in particular) but overall it makes sense to me - except for the following line: https://github.com/mika/jenkins-debian-glue/commit/9c37df2e10b0649672ea09fd07760f248dcce46d#diff-8d65555d96716579f851f18d113e9d79R331 Can you please elaborate a bit on your intentions? I believe we should at least sleep until the lock file goes away. We definitely shouldn't proceed with the build if we aren't 100% sure that our cowbuilder environment has actually successfully (and completely!) been updated. Even if another cowbuilder --update run is underway we should wait until it's done and then start our own update run. This would make sure we never miss the latest version of our build dependencies.

mark0n commented 8 years ago

Ok, I spent some time to track down what triggered this problem in the first place. As it turns out a user aborted a job while it was updating the cowbuilder environment. Other reasons that could potentially result in the lock file being left behind would be the crash of an executor or a loss of power.

linuxmaniac commented 8 years ago

Maybe a check for the date of the lockfile in order to verify that is a faulty one and remove it?

mark0n commented 8 years ago

How about waiting until either a) the lock file is being removed by some other process or b) the lock file is more than 10 minutes old?

For (a) we can just continue, for (b) we need to remove the lock file first.

mark0n commented 8 years ago

Ok, I think I understand the problem a bit better:

The lock file is created by

(
# code that uses the lock
) 9>"${update_lockfile}"

The first line inside this block is acquiring the lock on the open file with file descriptor 9 from the kernel. The lock is automatically released as soon as the file is closed. This should happen even in case the process dies pretty badly. But the file might be left behind! It seems like the real problem is the test here: It doesn't test for the lock but for the file. I guess something like

# remove the whole "Update run already taking place, skipping" block
(
flock --wait 600 9 || exit 1
) 9>@${update_lockfile}"

would be more appropriate.

We are currently testing a fix. I'll open a PR as soon as I feel confident about it.

mika / jenkins-debian-glue

build-and-provide does not update cowbuilder environment #158