gradle / gradle

Adaptable, fast automation for all
https://gradle.org
Apache License 2.0
16.58k stars 4.64k forks source link

Add a configuration-cache compatible way to gracefully handle exceptions from providers.exec #23914

Open mcclure opened 1 year ago

mcclure commented 1 year ago

Current Behavior

In the normal course of development we found we had written a build.gradle which succeeds on non-incremental builds but fails on incremental builds. The culprit was a providers.exec inside of a trycatch which, on failure, was failing the entire build instead of properly being caught by the catch (but only on incremental builds).

~We were able to fix this fully by adding executionResult.rethrowFailure() to the providers.exec, so we are not currently blocked, but (1) this is very unusual, because under other circumstances this does not seem to be required~ (EDIT: As described below this is simply wrong) (2) a build-tool failure that occurs inconsistently is highly problematic and should be fixed (3) the documentation does not in any clear way suggest this call is needed.

Expected Behavior

Builds should be "idempotent". If a build succeeds, then building a second time should also succeed.

More specifically: providers.exec inside of a trycatch should either always work [catch the failure] or always fail.

Even more specifically: If executionResult.rethrowFailure() is required for providers.exec to throw a failure, then it should always be required, not required in special edge cases and usually skippable.

Context

On open source Android Mastodon client Tusky we use providers.exec in our build.gradle to invoke git and get a git hash to include as part of the version (an operation common to many projects). However sometimes the project is built on systems without command line git (say, Windows) and we want to support these users. After a code refactor (we replaced exec with providers.exec and changed the order in which things were called), I started seeing a sporadic but persistent "could not find git" failure on one such system, even though we were doing what we believed we needed to do to handle the git-not-found case (catch the exception).

I lost a whole day to debugging this, and a major inhibitor was that the failure appeared to be "random". It occurred on some builds but not others and the pattern was not immediately apparent. I did multiple rounds of meaningless fixes where I made a speculative change, did a rebuild, the rebuild succeeded, and I falsely concluded my change was what had fixed the build (in fact, making any change to build.gradle at all would have resulted in a single successful build). Then several minutes later in the middle of something else I'd do another build and the failure would be back. It was very confusing.

Steps to Reproduce

Check out this branch https://github.com/mcclure/Tusky/tree/providers-exec-gradle-bug (git df087bd) of Tusky. (To demonstrate the problem better, the gradle has been modded to call a nonexistent program named "gggiiittt" instead of git, and the trycatch has been modified to catch "Throwable" instead of the usual "Exception"). Open in Android Studio and build variant blueDebug or greenDebug.

On the first build this will succeed but on the second build and subsequent builds it will fail. If you make any modification to build.gradle it will succeed on one additional build but then return to failing. If you do a clean build it will succeed once and then return to failing.

In our testing, the failures occurred with these exact symptoms on two separate Windows systems maintained by different people. We do not know whether the failure characteristics are the same when run on non-Windows OSes.

You do not need Android Studio to reproduce the problem. You can also reproduce it with pure Gradle by checking out the repo and running gradlew assembleGreenDebug twice. Screenshots of this are attached below.

Additional Code

You can see the "patched" version of this branch, with executionResult.rethrowFailure(), at https://github.com/nikclayton/Tusky/tree/patch-2 (git 913c7a7). This build always "succeeds" (IE the failure is properly caught by the try…catch and git hash "unknown" is assigned).

While filing this bug, I attempted to create a minimal example so you would not have to build all of Tusky to reproduce. I was not able to do this. Here is a sample app which takes the minimal activity sample code from "New Project" in Android Studio and adds all lines related to the git sha from Tusky build.gradle, including the product flavors ("green" flavor makes additional use of the git sha), again invoking gggiiittt instead of git: https://github.com/mcclure/android-bug/tree/gradle-incremental (git 23cc1df). Although the relevant build code is apparently the same as in Tusky, this does not reproduce the failure (IE in the example the failure to call gggiiittt is always properly caught, and handled by assigning hash "unknown"). "Real" Tusky must be doing something additional I have not identified which triggers the bug.

Notes

This bug was initially filed on the Android Studio issue tracker https://issuetracker.google.com/issues/268961150 but they closed the bug telling me to refile it on Gradle.

Figuring out what was happening with this bug was made much more difficult than it otherwise would be by #23902 (gradle Github issue tracker) and https://issuetracker.google.com/issues/268596607 (Android Studio issue tracker), both of which I found with this same faulty build.gradle script.

Nik Clayton who found the rethrowFailure fix explained (in informal conversation) that he actually would have expected the failure to be uncatchable, because providers.exec is a "task" and he would expected gradle to build a tree of tasks and by default if any task fails the entire tree fails. But if this is the case, the question remains of why the non-rethrowFailure code "succeeds" for non-incremental builds.

Someone else from the Tusky project hypothesized maybe Gradle is caching individual nodes in the task tree, and on the incremental build it is ignoring the build.gradle step and jumping right to the providers.exec step meaning the try…catch is bypassed. But if this is the case, this does not explain why the minimal example always "succeeds".

It is not clear to me whether it is "expected" for executionResult.rethrowFailure() to be required. But if it is, this should be surfaced better in the documentation. For example here is the documentation for rethrowFailure: https://docs.gradle.org/current/javadoc/org/gradle/process/ExecResult.html#rethrowFailure-- nothing here indicates how vitally important this innocuous function is, nor is it documented here or anywhere I can find what happens if you don't call it. Likewise, the general documentation for gradle tasks https://docs.gradle.org/current/userguide/tutorial_using_tasks.html does not anywhere I find on that page give instructions for special handling of failed tasks or warn that trycatch may be inadequate for some gradle API calls. In other words, this all seems like a clear functional bug to me, but if it is not a functional bug then there is a documentation bug; it minimum, the rethrowFailure docs should explain its intended use better. (EDIT: As described below executionResult.rethrowFailure was a red herring, but the documentation should still explain any pitfalls around exec better)

Your Environment

Build: AI-221.6008.13.2211.9514443, 202301210135, 

AI-221.6008.13.2211.9514443, JRE 11.0.15+0-b2043.56-9505619x64 JetBrains s.r.o., OS Windows 10(amd64) v10.0 , screens 1920.0x1080.0, 1024.0x600.0

AS: Electric Eel | 2022.1.1 Patch 1
Kotlin plugin: 221-1.8.0-release-for-android-studio-AS5591.52
Android Gradle Plugin: 7.4.1
Gradle: 7.6
Gradle JDK: JetBrains Runtime version 11.0.15
NDK: from local.properties: (not specified), latest from SDK: 21.0.6113669
CMake: from local.properties: (not specified), latest from SDK: 3.6.0-rc2, from PATH: (not found)

I was not able to collect a --scan. Running gradlew assembleGreenDebug --scan caused it to just print the same build failure, along with the message "Run with --scan to get full insights".

Screenshots

Success (first build)

image

Failure (second build, same window scrolled down)

image

mlopatkin commented 1 year ago

Thank you for the detailed report. The "weird" behavior you see is how the configuration caching works in Gradle. Long story short, when you use providers.exec, Gradle remembers the output of the external process; on the next run, Gradle re-runs just the process and compares the output with the stored one. If the outputs match, Gradle is able to skip a big chuck of work, making the build faster. However, in your case, it doesn't work properly, because the process cannot be started at all, and Gradle doesn't handle this use case well. When you run the "cold" build, try ... catch swallows the exception, but on the next run, Gradle re-runs the process internally to determine if the cache is still valid, and your try ... catch is not available there. I agree that Gradle should do a better job in this case.

I suspect that the "fix" of adding executionResult.rethrowFailure() breaks the happy path, because you get an exception from Groovy: groovy.lang.MissingPropertyException: Could not get unknown property 'executionResult' for object of type org.gradle.api.internal.provider.sources.process.ProviderCompatibleExecSpec.. There is no executionResult in that scope. It "fixes" your problem by not starting the process at all - the exception is thrown while preparing the process parameters.

As a workaround that you can apply immediately, I suggest using the custom ValueSource with ExecOperations, where you have more control:

import org.gradle.api.provider.ValueSourceParameters

abstract class GitShaValueSource implements ValueSource<String, ValueSourceParameters.None> {
  @Inject abstract ExecOperations getExecOperations()

  @Override String obtain() {
    try {
      def output = new ByteArrayOutputStream()

      execOperations.exec {
        it.commandLine 'git', 'rev-parse', 'HEAD'
        it.standardOutput = output
      }
      return output.toString().trim()
    } catch (GradleException ex) {
    }
    return "unknown"
  }
}

// For constructing gitSha only
final def gitSha = providers.of(GitShaValueSource) {}.get()

Gradle will call the obtain method when determining if the configuration cache should be discarded.

mcclure commented 1 year ago

@mlopatkin, thank you very much for the help. Your comment "I suspect that the "fix" of adding executionResult.rethrowFailure() breaks the happy path" was prescient; our nightly build last night, on a server where git works perfectly well, came out with git hash "unknown". *_* I have reformed your code as a PR in our project.

As far as Gradle goes, it seems to me that the basic thing we are trying to do here (get a single string from a command line tool, with failure tolerance) is a pretty simple and common thing for a build script to do. In fact, very many build scripts will be doing the same specific thing we are doing (extracting hash from git). So in my opinion the simplest way to do this thing being 27 lines long is a little alarming. Your code makes sense now that I read it, but I wouldn't have been able to write it on my own and in fact I think most Android programmers could not have reproduced this code without assistance. I don't know exactly what my preferred interface would be, but ideally it would be simpler and more discoverable than this.

Do you mind if I ask— what is the difference between exec, providers.exec, and execOperations.exec? If there is a piece of documentation that would explain I'd be happy to read it but I am not finding such documentation in an obvious place. Even just knowing what import these symbols came from would help.

mlopatkin commented 1 year ago

I think we have two issues to address here:

Do you mind if I ask— what is the difference between exec, providers.exec, and execOperations.exec? If there is a piece of documentation that would explain I'd be happy to read it but I am not finding such documentation in an obvious place. Even just knowing what import these symbols came from would help.

There are three doc chapters covering the example I provided:

  1. Build lifecycle gives an overview of configuration and execution phases
  2. Configuration cache explains how Gradle caches the build configuration and why you have to write build scripts in a particular way to benefit from it. In particular, this doc covers the use of ValueSource.
  3. Service injection provides some info on how @Inject magic works.

As for the execution APIs:

  1. exec corresponds to Project.exec and is going to be deprecated, because it isn't compatible with the configuration cache.
  2. ExecOperations.exec is a modern replacement for the above, intended to be injected into tasks and ValueSources. It doesn't require Project, so it is compatible with the configuration cache.

These APIs cannot be used at configuration phase (well, except inside an implementation of ValueSource.obtain, where you can do pretty much whatever you want). because Gradle has not enough insight on how the result of the run is going to be used inside the build script, how it is going to affect the configuration. Without this, Gradle cannot reliably cache the configuration. The same applies to the Java and Groovy APIs that start processes (ProcessBuilder and others). That's why providers.exec was introduced. With it Gradle knows, "ok, I'll re-run the command, grab the output, compare it to the output from the previous run, and discard the configuration cache if the outputs don't match".

There is also other use: you can set Provider returned by providers.exec to a Property of some task, thus making the output of the program an input to the task, without making the output part of the configuration inputs. This way, Gradle doesn't need to throw away configuration cache when the output changes, it just re-runs the task. Of course, you must not get() the Provider at the configuration time for this to work.

mcclure commented 1 year ago

Thank you for the explanations.

A ValueSource that throws inside the try-catch block… Most likely, we're going to discard the cache if this happens.

Hm, this is problematic because in the case I'm interested in our exec fails permanently (IE, in certain configurations it is expected to fail, and "the exec failed" is useful information we wish to propagate upward). So if the configuration cache clears every time the exec throws (IE: every build), this means our build is slowed down because there is never a cache. Or is the idea that we are protected from this because we now call the execOperations.exec inside of an obtain() (and catch the exception before it escapes the obtain())?

mlopatkin commented 1 year ago

Or is the idea that we are protected from this because we now call the execOperations.exec inside of an obtain() (and catch the exception before it escapes the obtain())?

Sorry for being unclear, I meant try ... catch wrapping get() method call on the provider made of a ValueSource, like you had before. If the exception doesn't escape obtain(), you're safe; Gradle only cares about the return value in this case.

mcclure commented 1 year ago

Or is the idea that we are protected from this because we now call the execOperations.exec inside of an obtain() (and catch the exception before it escapes the obtain())?

Sorry for being unclear, I meant try ... catch wrapping get() method call on the provider made of a ValueSource, like you had before. If the exception doesn't escape obtain(), you're safe; Gradle only cares about the return value in this case.

Got it, thanks.

A thought, because I do not think I would have known to do any of this if you had not explained it to me, maybe you should consider detecting when this has happened and adding a warning like "Discarded configuration cache, wrap this in a ValueSource for improved performance". Though I don't know if that would be disruptive.