cmd/go: linker repeatedly invoked for cached test results

bcmills commented 2 years ago

Adding the following extra assertions to cmd/go/testdata/script_build_cache_output.txt causes the test to fail at the ! stderr 'link(\.exe"?)? -' assertion:

--- c/src/cmd/go/testdata/script/build_cache_output.txt
+++ w/src/cmd/go/testdata/script/build_cache_output.txt
@@ -52,6 +52,25 @@ stderr '\d+ symbols' # from linker
 ! stderr 'p\.test( |\.exe"?)'
 stdout 'TEST' # from test

+# The gcflags above cause the compiler and linker to print diagnostic output,
+# but if we rebuild without those flags we should end up with a resulting test
+# binary with exactly the same contents, and thus the same ID.
+# The result of running that binary is already cached.
+go test -v -x p
+stdout 'TEST' # from test
+stdout '^ok\s+p\s+\(cached\)$' # from cmd/go
+stderr 'compile( |\.exe"?)'
+stderr 'link(\.exe"?)? -'
+stderr 'p\.test( |\.exe"?)'
+
+# If we run the test one more time, we shouldn't need to compile or link anything,
+# because we know that the ID will be the same.
+go test -v -x p
+stdout 'TEST' # from test
+stdout '^ok\s+p\s+\(cached\)$' # from cmd/go
+! stderr 'compile( |\.exe"?)'
+! stderr 'link(\.exe"?)? -'
+! stderr 'p\.test( |\.exe"?)'

 -- lib.go --
 package p

Note that both ok p (cached) assertions are passing, so this means that under some circumstances we are printing a test's output as “cached” but re-invoking the linker for the test anyway. There is some special-case code in cmd/go/internal/test (added in CL 75631) that attempts to prevent exactly this situation; it isn't yet clear to me why it isn't working.

(CC @matloob)

jproberts commented 2 years ago

@bcmills brought this issue to my attention in CL 376134 and I had a chance to investigate a bit. I still have questions but I think I've made enough progress to share some initial findings. I'm still new to this code so if anything I say seems incorrect, please call it out. Apologies if this is too verbose. I wanted to show my work in case I get anything wrong.

Investigation

I think the motivation for this ticket is the following comment in (*runCache).tryCacheWithID (cmd/go/internal/test/test.go)

// The test cache result fetch is a two-level lookup. // // First, we use the content hash of the test binary // and its command-line arguments to find the // list of environment variables and files consulted // the last time the test was run with those arguments. // (To avoid unnecessary links, we store this entry // under two hashes: id1 uses the linker inputs as a // proxy for the test binary, and id2 uses the actual // test binary. If the linker inputs are unchanged, // this way we avoid the link step, even though we // do not cache link outputs.) // ...

With that expectation, we have the following test cases from cmd/go/testdata/script_build_cache_output.txt. The script is abbreviated by excluding the actual assertions. It includes comments to number the commands for reference later.

# Running a test should run the compiler, linker, and the test the first time.
go test -v -x -gcflags=-m -ldflags=-v p # Command 1
# ... but not the second, even though it still prints the compiler, linker, and test output.
go test -v -x -gcflags=-m -ldflags=-v p # Command 2

### NOTE(jproberts): After this point are tests added by bcmills for Issue 49267 ###

# The gcflags above cause the compiler and linker to print diagnostic output,
# but if we rebuild without those flags we should end up with a resulting test
# binary with exactly the same contents, and thus the same ID.
# The result of running that binary is already cached.
go test -v -x p # Command 3
# If we run the test one more time, we shouldn't need to compile or link anything,
# because we know that the ID will be the same.
go test -v -x p # Command 4

I believe this scenario is failing because test results are only saved to the cache under id1 and id2 after the successful run of a test binary. This can be verified by viewing the logs of the test run and searching for .saveOutput( in the code.

Command 1 is acting on a fresh cache. It misses both levels of the cache lookup, resulting in a compile, link and run of the resulting test binary. The test binary run succeeds, so the test results are saved to the cache under id1 (linker inputs) and id2 (test binary).

Command 2 hits the cache on the linker inputs, since the inputs are the same as Command 1. A hit on linker inputs means we skip invoking the linker and immediately write the cached test output instead.

Command 3 missing the cache on the linker inputs (id1) because they don't include -gcflags or -ldflags. However, it hits on id2 since those flags only affect logging and don't change the actual test binary. A hit on id2 results in the ok p (cached) output and skips the run of the test binary. Since the test binary is not run, the test results are not saved under the new test inputs (id1). I believe this is the root problem of the test scenario provided.

Command 4 misses the cache on linker inputs because Command 3 never saved results into its id1. With that miss, it must invoke the linker, which fails the test. It hits on id2, resulting in ok p (cached) being written to the output again.

Next Steps

I think I've found the cause of this specific test case failure, but I recognize that this ticket may intend to cover broader cases which I haven't considered.

For this single case, I experimented with calling (*runCache).saveOutput within (*runCache).builderRunTest when tryCacheWithID succeeds, The tests pass, since this causes a hit on id2 to save results back to id1. However, my code needs some cleanup before I can file a CL, and I'm not sure this is the best fix point.

It may be worth considering whether tryCacheWithID should save its results when it succeeds. This would cause more frequent calls to saveOutput, which doesn't appear to exit early if it's saving the same data under the same key twice. That could take some work to add, but it may provide a broader and more robust solution going forward.

I hope this information helps. If solving this single case is valuable, I'm happy to take a pass at a fix. I'll wait for feedback from @bcmills or the Go team before doing any more work though. I'd like to help move this issue forward to progress #23565 but I don't mean to overtake an issue that already has an assignee.

TL;DR Test results are only saved to the cache when after a successful test binary run. The test case added in this issue hits the cache on id2 (test binary), so it never runs the test binary and therefore never saves the cached test results under id1 (linker inputs). Since a hit on id2 never saves results to the cache under its linker inputs, subsequent runs of the same command will still invoke the linker. I'm not sure whether this is the only issue with extra linking to retrieve cached results, but I believe it's the cause of the case provided.

prattmic commented 6 months ago

This may or may not be relevant, but while doing some other work in cmd/go/internal/work I noticed that the linkShared action doesn't seem to actually put the output in the build cache: https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=1750;drc=507d1b22f4b58ac68841582d0c2c0ab6b20e5a98. That said, I didn't look too deeply, so I may have missed something.

It does attempt a cache lookup, but if never puts anything in the cache presumably that is always a miss.

golang / go

cmd/go: linker repeatedly invoked for cached test results #49267

Investigation

Next Steps