golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.37k stars 17.71k forks source link

x/build/cmd/relui: advisory TryBots can handle or avoid timeouts better #60443

Open dmitshur opened 1 year ago

dmitshur commented 1 year ago

Advisory TryBots are causing much less friction compared to previously now that #57725 has been resolved. Filing this for some followup improvements that we can apply to how they handle timeouts.

Consider the following logs from linux-amd64-longtest-race and darwin-amd64-12_0 advisory trybot runs that timed out after all 3 tries:

Timeout after one slow package test
``` 2023/04/26 15:32:06 ======== Trybot Attempt 1 of 3 ======== 2023/04/26 15:32:06 Creating buildlet linux-amd64-longtest-race. 2023/04/26 15:35:26 Buildlet ready. 2023/04/26 15:35:26 Pushing source to buildlet. 2023/04/26 15:36:09 Installing go1.4. 2023/04/26 15:36:20 Testing 2023/04/26 15:36:46 Building Go cmd/dist using /workdir/go1.4. (go1.17.13 linux/amd64) Building Go toolchain1 using /workdir/go1.4. 2023/04/26 15:36:56 Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1. 2023/04/26 15:37:06 Building Go toolchain2 using go_bootstrap and Go toolchain1. 2023/04/26 15:37:26 Building Go toolchain3 using go_bootstrap and Go toolchain2. 2023/04/26 15:37:56 Building packages and commands for linux/amd64. 2023/04/26 15:38:46 ##### Test execution environment. # GOARCH: amd64 # CPU: Intel(R) Xeon(R) CPU @ 2.20GHz # GOOS: linux # OS Version: Linux 5.15.89+ #1 SMP Fri Apr 14 09:13:06 UTC 2023 x86_64 ##### Testing packages. 2023/04/26 15:38:56 ok archive/tar 1.717s 2023/04/26 15:42:26 ok archive/zip 218.491s ok bufio 0.497s ok bytes 18.052s ok compress/bzip2 0.551s 2023/04/26 15:44:16 ok compress/flate 319.931s ok compress/gzip 24.677s ok compress/lzw 1.146s ok compress/zlib 17.343s ok container/heap 0.052s ok container/list 0.068s ok container/ring 0.068s ok context 0.077s ok crypto 0.057s ok crypto/aes 0.108s ok crypto/cipher 2.234s ok crypto/des 0.094s ok crypto/dsa 15.913s ok crypto/ecdh 1.304s ok crypto/ecdsa 7.967s ok crypto/ed25519 2.225s ok crypto/elliptic 3.493s ok crypto/hmac 0.102s ok crypto/internal/alias 0.076s ok crypto/internal/bigmod 4.823s ok crypto/internal/boring 0.105s ok crypto/internal/boring/bcache 0.916s ok crypto/internal/edwards25519 76.131s ok crypto/internal/edwards25519/field 28.774s ok crypto/internal/nistec 53.465s ok crypto/internal/nistec/fiat 0.049s [no tests to run] ok crypto/md5 0.099s ok crypto/rand 7.026s ok crypto/rc4 1.032s ok crypto/rsa 15.508s ok crypto/sha1 0.151s ok crypto/sha256 0.110s ok crypto/sha512 0.124s ok crypto/subtle 44.809s ok crypto/tls 9.110s ok crypto/x509 4.257s ok database/sql 1.804s ok database/sql/driver 0.086s ok debug/buildinfo 11.564s ok debug/dwarf 0.185s ok debug/elf 3.727s ok debug/gosym 1.404s ok debug/macho 0.162s ok debug/pe 0.183s ok debug/plan9obj 0.050s ok embed 0.092s [no tests to run] ok embed/internal/embedtest 0.074s ok encoding/ascii85 0.088s ok encoding/asn1 0.087s ok encoding/base32 0.252s ok encoding/base64 0.097s ok encoding/binary 0.064s ok encoding/csv 0.118s ok encoding/gob 35.954s ok encoding/hex 0.061s ok encoding/json 20.127s ok encoding/pem 4.550s ok encoding/xml 81.810s ok errors 0.131s ok expvar 0.123s ok flag 0.428s ok fmt 1.247s ok go/ast 0.054s ok go/build 9.312s ok go/build/constraint 0.072s ok go/constant 0.083s ok go/doc 0.522s ok go/doc/comment 3.843s ok go/format 0.089s ok go/importer 0.947s ok go/internal/gccgoimporter 0.114s ok go/internal/gcimporter 84.599s ok go/internal/srcimporter 31.265s ok go/parser 93.094s ok go/printer 3.377s ok go/scanner 0.080s ok go/token 0.317s ok go/types 254.439s ok hash 0.043s ok hash/adler32 0.159s ok hash/crc32 0.164s ok hash/crc64 0.085s ok hash/fnv 0.064s ok hash/maphash 0.053s ok html 0.073s ok html/template 1.341s ok image 3.077s ok image/color 0.112s ok image/draw 1.355s ok image/gif 197.721s ok image/jpeg 47.669s ok image/png 11.062s 2023/04/26 15:54:16 testing failed: Command go/src/race.bash [] failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:54:16 ======== Trybot Attempt 2 of 3 ======== 2023/04/26 15:54:16 Creating buildlet linux-amd64-longtest-race. 2023/04/26 15:54:16 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:54:17 ======== Trybot Attempt 3 of 3 ======== 2023/04/26 15:54:17 Creating buildlet linux-amd64-longtest-race. 2023/04/26 15:54:17 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:54:17 Advisory TryBot timed out or was canceled {"Name": "linux-amd64-longtest-race", "Passed": false} ```
Timeout after one slow buildlet creation
``` 2023/04/26 15:32:06 ======== Trybot Attempt 1 of 3 ======== 2023/04/26 15:32:06 Creating buildlet darwin-amd64-12_0. 2023/04/26 15:42:06 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:42:06 ======== Trybot Attempt 2 of 3 ======== 2023/04/26 15:42:06 Creating buildlet darwin-amd64-12_0. 2023/04/26 15:42:06 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:42:07 ======== Trybot Attempt 3 of 3 ======== 2023/04/26 15:42:07 Creating buildlet darwin-amd64-12_0. 2023/04/26 15:42:07 Trybot Attempt failed: rpc error: code = Canceled desc = context canceled 2023/04/26 15:42:07 Advisory TryBot timed out or was canceled {"Name": "darwin-amd64-12_0", "Passed": false} ```

There are at least two areas for improvement visible.

First, it timed out because the single "index/suffixarray" tests happen to sometimes take longer than the default watchdog delay (currently 10 minutes), and so the watchdog mechanism stopped it. This can be avoided by injecting our own watchdog reset every 5 minutes as we do in some other tasks that run for a long time. (Increasing the watchdog delay is also a viable workaround.)

Second, the trybot attempts 2 and 3 don't end up actually trying because they start out with a canceled context. So there's effectively only one attempt. Issue #60444 may be relevant to this.

Filing for tracking purposes; it's not a high priority. CC @golang/release.

dmitshur commented 1 year ago

CC @prattmic.

gopherbot commented 1 year ago

Change https://go.dev/cl/503516 mentions this issue: internal/releasetargets: revert "set LongTestBuilder for darwin-amd64"

gopherbot commented 1 year ago

Change https://go.dev/cl/504523 mentions this issue: internal/{task,workflow}: scale watchdog delay by GO_TEST_TIMEOUT_SCALE

dmitshur commented 1 year ago

First, it timed out because the single "index/suffixarray" tests happen to sometimes take longer [...] (Increasing the watchdog delay is also a viable workaround.)

It's interesting to check if this case still happens with any regularity by now. It might not be since we started scaling the watchdog delay on longtest builders (CL 504523).

gopherbot commented 1 year ago

Change https://go.dev/cl/541380 mentions this issue: internal/relui: simplify advisory tests and don't accept timeouts