golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.61k stars 17.61k forks source link

os: TestChtimesOmit failures #68663

Closed gopherbot closed 2 months ago

gopherbot commented 2 months ago
#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"

Issue created automatically to collect these failures.

Example (log):

=== RUN   TestChtimesOmit
=== PAUSE TestChtimesOmit
=== CONT  TestChtimesOmit
    os_test.go:1395: omit atime: true, mtime: false
    os_test.go:1395: omit atime: false, mtime: true
    os_test.go:1475: atime mismatch, got: "2024-07-30 18:42:03.450932494 +0000 UTC", want: "2024-07-30 18:42:02.450932494 +0000 UTC"
    os_test.go:1395: omit atime: true, mtime: true
    os_test.go:1395: omit atime: false, mtime: false
--- FAIL: TestChtimesOmit (0.00s)

watchflakes

gopherbot commented 2 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
2024-07-30 14:19 gotip-linux-arm go@966688a4 os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1475: atime mismatch, got: "2024-07-30 18:42:03.450932494 +0000 UTC", want: "2024-07-30 18:42:02.450932494 +0000 UTC" os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false --- FAIL: TestChtimesOmit (0.00s)

watchflakes

gopherbot commented 2 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
2024-08-01 00:14 gotip-linux-arm go@ecadbe9f os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1475: atime mismatch, got: "2024-08-01 01:20:57.048078216 +0000 UTC", want: "2024-08-01 01:20:56.048078216 +0000 UTC" os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false --- FAIL: TestChtimesOmit (0.01s)

watchflakes

gopherbot commented 2 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
2024-08-01 21:32 gotip-linux-arm go@67c3f012 os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1475: atime mismatch, got: "2024-08-01 22:09:07.163843323 +0000 UTC", want: "2024-08-01 22:09:06.159843352 +0000 UTC" os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false --- FAIL: TestChtimesOmit (0.00s)

watchflakes

cherrymui commented 2 months ago

TestChtimesOmit seems flaky on linux-arm.

cc @kolyshkin @ianlancetaylor

ianlancetaylor commented 2 months ago

It's not just linux-arm, it's flaky on my amd64 laptop as well.

> go test -test.run=TestChtimesOmit -test.count=100 os
--- FAIL: TestChtimesOmit (0.00s)
    os_test.go:1395: omit atime: true, mtime: false
    os_test.go:1395: omit atime: false, mtime: true
    os_test.go:1395: omit atime: true, mtime: true
    os_test.go:1395: omit atime: false, mtime: false
    os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.556048593 -0700 PDT", want: "2024-08-06 16:19:25.556048593 -0700 PDT"
--- FAIL: TestChtimesOmit (0.00s)
    os_test.go:1395: omit atime: true, mtime: false
    os_test.go:1395: omit atime: false, mtime: true
    os_test.go:1395: omit atime: true, mtime: true
    os_test.go:1395: omit atime: false, mtime: false
    os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.572048741 -0700 PDT", want: "2024-08-06 16:19:25.572048741 -0700 PDT"
FAIL
FAIL    os  0.043s
FAIL

In all cases the desired atime appears to be one second earlier than the measured atime. Perhaps calling utimensat will update the atime, but only on a one second offset.

ianlancetaylor commented 2 months ago

Here is a failure shown by strace:

newfstatat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", {st_dev=makedev(0xfd, 0x1), st_ino=113514259, st_mode=S_IFREG|0600, st_nlink=1, st_uid=17684, st_gid=89939, st_blksize=4096, st_blocks=8, st_size=13, 
st_atime=1722990239 /* 2024-08-06T17:23:59.721116372-0700 */, st_atime_nsec=721116372, 
st_mtime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_mtime_nsec=725116236, 
st_ctime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_ctime_nsec=725116236}, 0) = 0

utimensat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", [UTIME_OMIT, 
{tv_sec=1722990238, tv_nsec=725116236} /* 2024-08-06T17:23:58.725116236-0700 */], 0) = 0

newfstatat(AT_FDCWD, "/tmp/_Go_TestChtimesOmit3451782255", {st_dev=makedev(0xfd, 0x1), st_ino=113514259, st_mode=S_IFREG|0600, st_nlink=1, st_uid=17684, st_gid=89939, st_blksize=4096, st_blocks=8, st_size=13, 
st_atime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_atime_nsec=725116236, 
st_mtime=1722990238 /* 2024-08-06T17:23:58.725116236-0700 */, st_mtime_nsec=725116236, 
st_ctime=1722990239 /* 2024-08-06T17:23:59.725116236-0700 */, st_ctime_nsec=725116236}, 0) = 0

...

write(1, "--- FAIL: TestChtimesOmit (0.00s"..., 375--- FAIL: TestChtimesOmit (0.00s)
    os_test.go:1395: omit atime: true, mtime: false
    os_test.go:1475: atime mismatch, got: "2024-08-06 17:23:59.725116236 -0700 PDT", want: "2024-08-06 17:23:59.721116372 -0700 PDT"
    os_test.go:1395: omit atime: false, mtime: true
    os_test.go:1395: omit atime: true, mtime: true
    os_test.go:1395: omit atime: false, mtime: false
) = 375

In this case it's actually not a one second difference. What we see here is that the st_atime_nsec field changes to the nsec value passed for mtime in the call to utimensat. The st_atime seconds field is left unchanged.

Another possibility is that the atime is being set to the ctime, or the original mtime.

Looking at the output of mount, I see that the disk I am using is mounted with the relatime option. It appears that relatime means when the file is updated, the atime is updated also if it precedes the mtime or ctime. That is indeed the case in the first state call. So it appears that in this case the utimensat is updating the atime as well as the mtime, so the test fails.

Looking at some of the passing cases from the same strace run, in those cases the atime is after the mtime in the first stat, so it doesn't get updated.

And relatime appears to be unpredictable. We are deliberately setting atime one second earlier, which means that it precedes ctime. It seems that in some cases relatime kicks in and updates atime to the ctime, causing the test to fail. At least, that is my best guess.

gopherbot commented 2 months ago

Change https://go.dev/cl/603404 mentions this issue: os: avoid atime confusion in TestChtimesOmit on Linux relatime

kolyshkin commented 2 months ago

I was unable to reproduce it locally (kernel 6.9.7-200.fc40.x86_64).

By default (if $TMPDIR is not set) these tests are run using files under /tmp (which is not mounted with relatime on my system). Out of curiosity, I also tried running this on ext4 (which is mounted with relatime), as well as remounting /tmp with relatime option. No luck in reproducing (ran with -test.count=1000000 for each setup).

I also tried a (much older) kernel 3.10.0-1160.118.1.el7.x86_64 inside a VM, with all the options listed above.

Given into account that tests are run on /tmp (unless TMPDIR is set) and I think it is not usually mounted with relatime, I think the issue might not be related to relatime.

I will take a fresh look tomorrow.

kolyshkin commented 2 months ago

Also, looking into the failure above (https://github.com/golang/go/issues/68663#issuecomment-2272324260)

os_test.go:1395: omit atime: false, mtime: false

This test case (when both omit* values are false) is doing roughly the same as TestChtimes (which also decrements times reported by stat by 1 second, sets the new times and checks they went backwards, instead of checking for a specific timestamp).

Looking at the failure:

os_test.go:1475: atime mismatch, got: "2024-08-06 16:19:26.572048741 -0700 PDT", want: "2024-08-06 16:19:25.572048741 -0700 PDT"

I would assume that TestChtimes will fail similarly, but it seems it never fails in LUCI (and there are no special cases for linux or mount options in it).

kolyshkin commented 2 months ago

I would assume that TestChtimes will fail similarly, but it seems it never fails in LUCI

I made the above statement foolishly without looking. At least my prediction that it would fail too is correct -- that test just started failing, too (#68687), at around the same time as this one, and also on arm!

Note that TestChtimes (unlike the failed test in this issue) hasn't changed for years (except for some cosmetic changes in go.dev/cl/594075), so it's not possible to attribute the new failures to recent changes in test code.

Currently my best guess is there was some change in the test environment (gotip-linux-arm), done on or before 2024-07-30, which resulted in this failure.

kolyshkin commented 2 months ago

Probably just a coincidence, but failures started when release-branch.go1.23 was created off of commit 966688a43a (this is the first commit with the failure). Maybe a new ARM test env was created or reconfigured around that time.

gopherbot commented 2 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
2024-08-07 17:22 gotip-linux-arm go@5a81570b os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false os_test.go:1475: atime mismatch, got: "2024-08-07 22:54:52.682004315 +0000 UTC", want: "2024-08-07 22:54:51.682004315 +0000 UTC" --- FAIL: TestChtimesOmit (0.00s)
2024-08-07 19:37 gotip-linux-arm go@808edd50 os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false os_test.go:1475: atime mismatch, got: "2024-08-07 22:24:58.724473488 +0000 UTC", want: "2024-08-07 22:24:57.724473488 +0000 UTC" --- FAIL: TestChtimesOmit (0.00s)

watchflakes

kolyshkin commented 2 months ago

If someone who maintains LUCI runners can take a look at what was changed in environment used by gotip-linux-arm, I'd be grateful.

Still can't reproduce this locally, and am stuck as of what to do about it. I read up on gomote but it seems it requires access I don't have (filed #68779).

Can't think of anything else but opening a "debug" CL and trying to reproduce it that way.

gopherbot commented 2 months ago

Change https://go.dev/cl/604155 mentions this issue: [DO NOT MERGE] test chtimes failures on gotip-linux-arm

kolyshkin commented 2 months ago

Obtained some info about the testing environment in which it fails:

Also, from 1000 iterations of TestChtimesOmit, we got 115 failures, with all possible values for omitAtime and omitMtime, but about 90% of cases are with omitAtime == false:

$ grep -B1 mismatch runs | grep omit | sort | uniq -c
     66     os_test.go:1424: omit atime: false, mtime: false
     37     os_test.go:1424: omit atime: false, mtime: true
      8     os_test.go:1424: omit atime: true, mtime: false
      4     os_test.go:1424: omit atime: true, mtime: true
kolyshkin commented 2 months ago

A few more raw datapoints.

Linux 87493462f40f 6.1.0-17-cloud-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux

All that can be obtained from https://go.dev/cl/604155

kolyshkin commented 2 months ago

Managed to run arm64 debian 12 under qemu-system-aarch64. Ran docker inside. Ran a 32-bit arm container:

docker run -it --platform linux/arm golang:1.23rc2-bookworm

Once inside the container:

setarch armv8l -- bash
go test -build os
./os.test -test.run Chtimes -test.count 1000

As far as I can see it looks very much like the CI environment on which we see the failures, and I was not able to reproduce the issue.

Trying the exact same kernel now (since it's emulation, everything is running very slow).

kolyshkin commented 2 months ago

Trying with the same kernel as in failing test env, and with the same/similar go version

Building go:

root@ae8a9b248736:/go# uname -a
Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) aarch64 GNU/Linux

root@ae8a9b248736:/go# setarch armv8l -- bash

root@ae8a9b248736:~/git/go/src# uname -a
Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux

root@ae8a9b248736:~/git/go/src# time ./make.bash 
Building Go cmd/dist using /usr/local/go. (go1.23rc2 linux/arm)
Building Go toolchain1 using /usr/local/go.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/arm.
---
Installed Go for linux/arm in /root/git/go
Installed commands in /root/git/go/bin
*** You need to add /root/git/go/bin to your PATH.

real    23m3.478s
user    70m9.128s
sys 9m10.303s

root@ae8a9b248736:~/git/go/src# go version
go version devel go1.24-9177e12ccc Wed Aug 7 20:28:11 2024 +0000 linux/arm

Running the "sysinfo" (same as in https://go.dev/cl/604155):

root@ae8a9b248736:~/git/go/src# ./os.test -test.run Sysinfo 
--- FAIL: TestSysinfo (0.07s)
    os_test.go:1365: OS release: (err: <nil>)
        PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
        NAME="Debian GNU/Linux"
        VERSION_ID="12"
        VERSION="12 (bookworm)"
        VERSION_CODENAME=bookworm
        ID=debian
        HOME_URL="https://www.debian.org/"
        SUPPORT_URL="https://www.debian.org/support"
        BUG_REPORT_URL="https://bugs.debian.org/"

    os_test.go:1368: uname -a: (err: <nil>)
        Linux ae8a9b248736 6.1.0-17-arm64 #1 SMP Debian 6.1.69-1 (2023-12-30) armv8l GNU/Linux

    os_test.go:1371: mount: (err: <nil>)
        overlay on / type overlay (rw,relatime,lowerdir=/home/var_lib_docker/overlay2/l/MD4BV3SQOH77C3Y2ZLW2BT47XL:/home/var_lib_docker/overlay2/l/JYQ4Y6DG33GNLUEKTO7D7HJJQT:/home/var_lib_docker/overlay2/l/GHZEDFQZO4WUXJRTCQOM3SRJVV:/home/var_lib_docker/overlay2/l/FUJVVWVP2J5TU5C2T2QTQNJFSL:/home/var_lib_docker/overlay2/l/RYG6POSETEER6MBTJZQXSMU7FG:/home/var_lib_docker/overlay2/l/JT6HFJ4PTZ6GHMLD4D7TEUTOWZ:/home/var_lib_docker/overlay2/l/7ZO5WOCLALAWBW7UBOGBSMCV4T:/home/var_lib_docker/overlay2/l/RQBJZ3BV5O4TYX4YDSILET4TMY,upperdir=/home/var_lib_docker/overlay2/2919a2336a377c15a11086d1461819de52a61445b060800461362957c3f7ea33/diff,workdir=/home/var_lib_docker/overlay2/2919a2336a377c15a11086d1461819de52a61445b060800461362957c3f7ea33/work)
        proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
        tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
        devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
        sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
        cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
        mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
        shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k,inode64)
        /dev/vdb on /etc/resolv.conf type ext4 (rw,relatime)
        /dev/vdb on /etc/hostname type ext4 (rw,relatime)
        /dev/vdb on /etc/hosts type ext4 (rw,relatime)
        devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
        proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
        proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
        proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
        proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
        proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
        tmpfs on /proc/acpi type tmpfs (ro,relatime,inode64)
        tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
        tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
        tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
        tmpfs on /sys/firmware type tmpfs (ro,relatime,inode64)

    os_test.go:1373: TMPDIR: "/root/git/go/src/tmp"

    os_test.go:1376: newFile: "/root/git/go/src/tmp/_Go_TestSysinfo251614211"

    os_test.go:1380: statfs: {Type:2035054128 Bsize:4096 Blocks:2554693 Bfree:1451217 Bavail:1316049 Files:655360 Ffree:550890 Fsid:{X__val:[-718345489 682481083]} Namelen:255 Frsize:4096 Flags:4128 Spare:[0 0 0 0] Pad_cgo_0:[0 0 0 0]} (err: <nil>)

FAIL

Running the test:

root@ae8a9b248736:~/git/go/src# time go test -c os 

real    0m14.104s
user    0m21.839s
sys 0m4.966s

root@ae8a9b248736:~/git/go/src# file os.test 
os.test: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, BuildID[sha1]=c59e564785d83f86b24c4861f044e12b6d03ff33, for GNU/Linux 3.2.0, with debug_info, not stripped

root@ae8a9b248736:~/git/go/src# ldd os.test 
    linux-vdso.so.1 (0xf7d4e000)
    libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xf7c2f000)
    /lib/ld-linux-armhf.so.3 (0xf7d4f000)

root@ae8a9b248736:~/git/go/src# ./os.test -test.run Chtimes -test.count 3 -test.parallel 3
PASS

(the above is a patched version which does 1000 iterations internally, so it should fail)

So, can't reproduce this in the best approximation of environment I can produce locally. Guess I will continue experimenting with https://go.dev/cl/604155.

PS Please excuse my verbosity,

dmitshur commented 2 months ago

If someone who maintains LUCI runners can take a look at what was changed in environment used by gotip-linux-arm, I'd be grateful.

I commented on this in https://github.com/golang/go/issues/68687#issuecomment-2276468795. I suspect that change in the environment is a likely culprit. Sorry for the trouble.

If it's not viable to make the test work without flakes within the constraint of the new environment, at least not without compromising the signal it provides from other builders, it might be best to skip the test on the LUCI linux/arm builders. Similar security software may also run in the background of the LUCI windows/arm64 builder, but I don't expect related changes to any other other LUCI builders that we run.

ianlancetaylor commented 2 months ago

We need to fix the test regardless. As noted, it currently fails sporadically on my laptop. Shipping a test that fails on some people's machines just encourages bug reports.

I created a linux-arm gomote and verified that on that gomote /tmp is mounted with relatime. So I continue to suspect that that is the root cause. I haven't had time to look at these issues for a couple of days.

kolyshkin commented 2 months ago

We need to fix the test regardless. As noted, it currently fails sporadically on my laptop. Shipping a test that fails on some people's machines just encourages bug reports.

Can you please share some more details about your setup? I tried to reproduce this to no avail.

With the new information from @dmitshur my best guess is someone else is reading the file that the test creates, thus changing its atime. Can it be the case on your laptop?

As I pointed out in https://github.com/golang/go/issues/68687#issuecomment-2276540799, this can probably mitigated by having an empty file.

kolyshkin commented 2 months ago

I created a linux-arm gomote and verified that on that gomote /tmp is mounted with relatime. So I continue to suspect that that is the root cause.

I don't think so.

As I don't have gomote access, I added some code to figure out these details (see TestSysinfo in https://go-review.googlesource.com/c/go/+/604155?checksPatchset=1). What I found is shown above in https://github.com/golang/go/issues/68663#issuecomment-2274843105.

In particular, the tests are run in environment with no /tmp mounted and TMPDIR set to a directory under /home/swarming/ (meaning the test files are not even created under /tmp).

gopherbot commented 2 months ago

Change https://go.dev/cl/604315 mentions this issue: os: fix Chtimes test flakes

gopherbot commented 2 months ago

Change https://go.dev/cl/604196 mentions this issue: os: fix Chtimes test flakes

gopherbot commented 2 months ago

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "os" && test == "TestChtimesOmit"
2024-08-08 20:06 gotip-linux-arm go@fe392d0d os.TestChtimesOmit (log) === RUN TestChtimesOmit === PAUSE TestChtimesOmit === CONT TestChtimesOmit os_test.go:1395: omit atime: true, mtime: false os_test.go:1395: omit atime: false, mtime: true os_test.go:1395: omit atime: true, mtime: true os_test.go:1395: omit atime: false, mtime: false os_test.go:1475: atime mismatch, got: "2024-08-08 21:10:55.280875318 +0000 UTC", want: "2024-08-08 21:10:54.280875318 +0000 UTC" --- FAIL: TestChtimesOmit (0.01s)

watchflakes

kolyshkin commented 2 months ago

2024-08-08 20:06 gotip-linux-arm go@fe392d0d os.TestChtimesOmit (log)

This commit (fe392d0d) is before the CL 604196 is merged.

gopherbot commented 2 months ago

Change https://go.dev/cl/605375 mentions this issue: [release-branch.go1.22] os: fix Chtimes test flakes