Closed gopherbot closed 2 months ago
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
TestChtimesOmit seems flaky on linux-arm.
cc @kolyshkin @ianlancetaylor
It's not just linux-arm, it's flaky on my amd64 laptop as well.
> go test -test.run=TestChtimesOmit -test.count=100 os
--- FAIL: TestChtimesOmit (0.00s)
os_test.go:1395: omit atime: true, mtime: false
os_test.go:1395: omit atime: false, mtime: true
os_test.go:1395: omit atime: true, mtime: true
os_test.go:1395: omit atime: false, mtime: false
os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.556048593 -0700 PDT", want: "2024-08-06 16:19:25.556048593 -0700 PDT"
--- FAIL: TestChtimesOmit (0.00s)
os_test.go:1395: omit atime: true, mtime: false
os_test.go:1395: omit atime: false, mtime: true
os_test.go:1395: omit atime: true, mtime: true
os_test.go:1395: omit atime: false, mtime: false
os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.572048741 -0700 PDT", want: "2024-08-06 16:19:25.572048741 -0700 PDT"
FAIL
FAIL os 0.043s
FAIL
In all cases the desired atime appears to be one second earlier than the measured atime. Perhaps calling utimensat
will update the atime, but only on a one second offset.
Here is a failure shown by strace:
newfstatat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", {st_dev=makedev(0xfd, 0x1), st_ino=113514259, st_mode=S_IFREG|0600, st_nlink=1, st_uid=17684, st_gid=89939, st_blksize=4096, st_blocks=8, st_size=13,
st_atime=1722990239 /* 2024-08-06T17:23:59.721116372-0700 */, st_atime_nsec=721116372,
st_mtime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_mtime_nsec=725116236,
st_ctime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_ctime_nsec=725116236}, 0) = 0
utimensat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", [UTIME_OMIT,
{tv_sec=1722990238, tv_nsec=725116236} /* 2024-08-06T17:23:58.725116236-0700 */], 0) = 0
newfstatat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", {st_dev=makedev(0xfd, 0x1), st_ino=113514259, st_mode=S_IFREG|0600, st_nlink=1, st_uid=17684, st_gid=89939, st_blksize=4096, st_blocks=8, st_size=13,
st_atime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_atime_nsec=725116236,
st_mtime=1722990238 /* 2024-08-06T17:23:58.725116236-0700 */, st_mtime_nsec=725116236,
st_ctime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_ctime_nsec=725116236}, 0) = 0
...
write(1, "--- FAIL: TestChtimesOmit (0.00s"..., 375--- FAIL: TestChtimesOmit (0.00s)
os_test.go:1395: omit atime: true, mtime: false
os_test.go:1475: atime mismatch, got: "2024-08-06 17:23:59.725116236 -0700 PDT", want: "2024-08-06 17:23:59.721116372 -0700 PDT"
os_test.go:1395: omit atime: false, mtime: true
os_test.go:1395: omit atime: true, mtime: true
os_test.go:1395: omit atime: false, mtime: false
) = 375
In this case it's actually not a one second difference. What we see here is that the st_atime_nsec
field changes to the nsec value passed for mtime in the call to utimensat
. The st_atime
seconds field is left unchanged.
Another possibility is that the atime is being set to the ctime, or the original mtime.
Looking at the output of mount
, I see that the disk I am using is mounted with the relatime
option. It appears that relatime means when the file is updated, the atime is updated also if it precedes the mtime or ctime. That is indeed the case in the first state call. So it appears that in this case the utimensat
is updating the atime as well as the mtime, so the test fails.
Looking at some of the passing cases from the same strace run, in those cases the atime is after the mtime in the first stat, so it doesn't get updated.
And relatime appears to be unpredictable. We are deliberately setting atime one second earlier, which means that it precedes ctime. It seems that in some cases relatime kicks in and updates atime to the ctime, causing the test to fail. At least, that is my best guess.
Change https://go.dev/cl/603404 mentions this issue: os: avoid atime confusion in TestChtimesOmit on Linux relatime
I was unable to reproduce it locally (kernel 6.9.7-200.fc40.x86_64).
By default (if $TMPDIR
is not set) these tests are run using files under /tmp
(which is not mounted with relatime
on my system). Out of curiosity, I also tried running this on ext4
(which is mounted with relatime
), as well as remounting /tmp
with relatime
option. No luck in reproducing (ran with -test.count=1000000
for each setup).
I also tried a (much older) kernel 3.10.0-1160.118.1.el7.x86_64 inside a VM, with all the options listed above.
Given into account that tests are run on /tmp (unless TMPDIR is set) and I think it is not usually mounted with relatime
, I think the issue might not be related to relatime.
I will take a fresh look tomorrow.
Also, looking into the failure above (https://github.com/golang/go/issues/68663#issuecomment-2272324260)
os_test.go:1395: omit atime: false, mtime: false
This test case (when both omit* values are false) is doing roughly the same as TestChtimes
(which also decrements times reported by stat by 1 second, sets the new times and checks they went backwards, instead of checking for a specific timestamp).
Looking at the failure:
os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.572048741 -0700 PDT", want: "2024-08-06 16:19:25.572048741 -0700 PDT"
I would assume that TestChtimes will fail similarly, but it seems it never fails in LUCI (and there are no special cases for linux or mount options in it).
I would assume that TestChtimes will fail similarly, but it seems it never fails in LUCI
I made the above statement foolishly without looking. At least my prediction that it would fail too is correct -- that test just started failing, too (#68687), at around the same time as this one, and also on arm!
Note that TestChtimes
(unlike the failed test in this issue) hasn't changed for years (except for some cosmetic changes in go.dev/cl/594075), so it's not possible to attribute the new failures to recent changes in test code.
Currently my best guess is there was some change in the test environment (gotip-linux-arm), done on or before 2024-07-30, which resulted in this failure.
Probably just a coincidence, but failures started when release-branch.go1.23
was created off of commit 966688a43a (this is the first commit with the failure). Maybe a new ARM test env was created or reconfigured around that time.
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
If someone who maintains LUCI runners can take a look at what was changed in environment used by gotip-linux-arm
, I'd be grateful.
Still can't reproduce this locally, and am stuck as of what to do about it. I read up on gomote but it seems it requires access I don't have (filed #68779).
Can't think of anything else but opening a "debug" CL and trying to reproduce it that way.
Change https://go.dev/cl/604155 mentions this issue: [DO NOT MERGE] test chtimes failures on gotip-linux-arm
Obtained some info about the testing environment in which it fails:
rw,relatime,discard,errors=remount-ro
(usual options), using SSD.uname -a
) is:
Linux 87493462f40f 6.1.0-17-cloud-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux
Also, from 1000 iterations of TestChtimesOmit, we got 115 failures, with all possible values for omitAtime and omitMtime, but about 90% of cases are with omitAtime == false
:
$ grep -B1 mismatch runs | grep omit | sort | uniq -c
66 os_test.go:1424: omit atime: false, mtime: false
37 os_test.go:1424: omit atime: false, mtime: true
8 os_test.go:1424: omit atime: true, mtime: false
4 os_test.go:1424: omit atime: true, mtime: true
A few more raw datapoints.
go1.23-linux-arm
has a configuration very similar to gotip-linux-arm
and fails in a similar way, too (1000 runs, 84 failures, of which 78 are for omitAtime == false
). Uname output is also similar:Linux 87493462f40f 6.1.0-17-cloud-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux
gotip-linux-arm64
is not failing. It uses Debian 11 (Bullseye) inside a container (i.e. it's older), and the uname -a
output differs:
Linux golang-tryw-t2a-linux-arm64-bullseye-us-central1-a-0-u2rq 6.1.100+ #1 SMP Fri Aug 2 16:04:36 UTC 2024 aarch64 GNU/Linux
gotip-linux-386
is not failing, and is also using Debian 11 inside container, but the kernel is older:
Linux golang-tryw-n1-linux-x86-bullseye-us-central1-a-0-8gja 5.15.155+ #1 SMP Tue Jul 16 10:15:54 UTC 2024 x86_64 GNU/Linux
gotip-linux-amd64
is not failing, the container distro and the host kernel are similar to gotip-linux-386
case.
All that can be obtained from https://go.dev/cl/604155
Managed to run arm64 debian 12 under qemu-system-aarch64. Ran docker inside. Ran a 32-bit arm container:
docker run -it --platform linux/arm golang:1.23rc2-bookworm
Once inside the container:
setarch armv8l -- bash
go test -build os
./os.test -test.run Chtimes -test.count 1000
As far as I can see it looks very much like the CI environment on which we see the failures, and I was not able to reproduce the issue.
Trying the exact same kernel now (since it's emulation, everything is running very slow).
Trying with the same kernel as in failing test env, and with the same/similar go version
Building go:
root@ae8a9b248736:/go# uname -a
Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) aarch64 GNU/Linux
root@ae8a9b248736:/go# setarch armv8l -- bash
root@ae8a9b248736:~/git/go/src# uname -a
Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux
root@ae8a9b248736:~/git/go/src# time ./make.bash
Building Go cmd/dist using /usr/local/go. (go1.23rc2 linux/arm)
Building Go toolchain1 using /usr/local/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/arm.
---
Installed Go for linux/arm in /root/git/go
Installed commands in /root/git/go/bin
*** You need to add /root/git/go/bin to your PATH.
real 23m3.478s
user 70m9.128s
sys 9m10.303s
root@ae8a9b248736:~/git/go/src# go version
go version devel go1.24-9177e12ccc Wed Aug 7 20:28:11 2024 +0000 linux/arm
Running the "sysinfo" (same as in https://go.dev/cl/604155):
root@ae8a9b248736:~/git/go/src# ./os.test -test.run Sysinfo
--- FAIL: TestSysinfo (0.07s)
os_test.go:1365: OS release: (err: <nil>)
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
os_test.go:1368: uname -a: (err: <nil>)
Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux
os_test.go:1371: mount: (err: <nil>)
overlay on / type overlay (rw,relatime,lowerdir=/home/var_lib_docker/overlay2/l/MD4BV3SQOH77C3Y2ZLW2BT47XL:/home/var_lib_docker/overlay2/l/JYQ4Y6DG33GNLUEKTO7D7HJJQT:/home/var_lib_docker/overlay2/l/GHZEDFQZO4WUXJRTCQOM3SRJVV:/home/var_lib_docker/overlay2/l/FUJVVWVP2J5TU5C2T2QTQNJFSL:/home/var_lib_docker/overlay2/l/RYG6POSETEER6MBTJZQXSMU7FG:/home/var_lib_docker/overlay2/l/JT6HFJ4PTZ6GHMLD4D7TEUTOWZ:/home/var_lib_docker/overlay2/l/7ZO5WOCLALAWBW7UBOGBSMCV4T:/home/var_lib_docker/overlay2/l/RQBJZ3BV5O4TYX4YDSILET4TMY,upperdir=/home/var_lib_docker/overlay2/2919a2336a377c15a11086d1461819de52a61445b060800461362957c3f7ea33/diff,workdir=/home/var_lib_docker/overlay2/2919a2336a377c15a11086d1461819de52a61445b060800461362957c3f7ea33/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k,inode64)
/dev/vdb on /etc/resolv.conf type ext4 (rw,relatime)
/dev/vdb on /etc/hostname type ext4 (rw,relatime)
/dev/vdb on /etc/hosts type ext4 (rw,relatime)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime,inode64)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
tmpfs on /sys/firmware type tmpfs (ro,relatime,inode64)
os_test.go:1373: TMPDIR: "/root/git/go/src/tmp"
os_test.go:1376: newFile: "/root/git/go/src/tmp/_Go_TestSysinfo251614211"
os_test.go:1380: statfs: {Type:2035054128 Bsize:4096 Blocks:2554693 Bfree:1451217 Bavail:1316049 Files:655360 Ffree:550890 Fsid:{X__val:[-718345489 682481083]} Namelen:255 Frsize:4096 Flags:4128 Spare:[0 0 0 0] Pad_cgo_0:[0 0 0 0]} (err: <nil>)
FAIL
Running the test:
root@ae8a9b248736:~/git/go/src# time go test -c os
real 0m14.104s
user 0m21.839s
sys 0m4.966s
root@ae8a9b248736:~/git/go/src# file os.test
os.test: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=c59e564785d83f86b24c4861f044e12b6d03ff33, for GNU/Linux 3.2.0, with debug_info, not stripped
root@ae8a9b248736:~/git/go/src# ldd os.test
linux-vdso.so.1 (0xf7d4e000)
libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xf7c2f000)
/lib/ld-linux-armhf.so.3 (0xf7d4f000)
root@ae8a9b248736:~/git/go/src# ./os.test -test.run Chtimes -test.count 3 -test.parallel 3
PASS
(the above is a patched version which does 1000 iterations internally, so it should fail)
So, can't reproduce this in the best approximation of environment I can produce locally. Guess I will continue experimenting with https://go.dev/cl/604155.
PS Please excuse my verbosity,
If someone who maintains LUCI runners can take a look at what was changed in environment used by
gotip-linux-arm
, I'd be grateful.
I commented on this in https://github.com/golang/go/issues/68687#issuecomment-2276468795. I suspect that change in the environment is a likely culprit. Sorry for the trouble.
If it's not viable to make the test work without flakes within the constraint of the new environment, at least not without compromising the signal it provides from other builders, it might be best to skip the test on the LUCI linux/arm builders. Similar security software may also run in the background of the LUCI windows/arm64 builder, but I don't expect related changes to any other other LUCI builders that we run.
We need to fix the test regardless. As noted, it currently fails sporadically on my laptop. Shipping a test that fails on some people's machines just encourages bug reports.
I created a linux-arm gomote and verified that on that gomote /tmp is mounted with relatime. So I continue to suspect that that is the root cause. I haven't had time to look at these issues for a couple of days.
We need to fix the test regardless. As noted, it currently fails sporadically on my laptop. Shipping a test that fails on some people's machines just encourages bug reports.
Can you please share some more details about your setup? I tried to reproduce this to no avail.
With the new information from @dmitshur my best guess is someone else is reading the file that the test creates, thus changing its atime. Can it be the case on your laptop?
As I pointed out in https://github.com/golang/go/issues/68687#issuecomment-2276540799, this can probably mitigated by having an empty file.
I created a linux-arm gomote and verified that on that gomote /tmp is mounted with relatime. So I continue to suspect that that is the root cause.
I don't think so.
As I don't have gomote access, I added some code to figure out these details (see TestSysinfo
in https://go-review.googlesource.com/c/go/+/604155?checksPatchset=1). What I found is shown above in https://github.com/golang/go/issues/68663#issuecomment-2274843105.
In particular, the tests are run in environment with no /tmp mounted and TMPDIR set to a directory under /home/swarming/ (meaning the test files are not even created under /tmp
).
Change https://go.dev/cl/604315 mentions this issue: os: fix Chtimes test flakes
Change https://go.dev/cl/604196 mentions this issue: os: fix Chtimes test flakes
Found new dashboard test flakes for:
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
Change https://go.dev/cl/605375 mentions this issue: [release-branch.go1.22] os: fix Chtimes test flakes
Issue created automatically to collect these failures.
Example (log):
— watchflakes