Configurable strategy for handling remote build cache errors

joshfriend commented 10 months ago

Expected Behavior

If a build cache error is thrown, and the --continue parameter is not given, the build should be stopped immediately and report the build cache error as the cause of the build failing.

Current Behavior (optional)

If a build cache error is thrown, the cache is disabled and the build continues without any cache hits from that source, which can make builds slower

Context

Our build cache requires auth via our corporate SSO, which expires every 12h. There is also a bug in the auth system that can cause failures even when properly logged in, though this seems relatively infrequent. Any of these errors means cache gets disabled for a build and it can take a lot longer than usual without the user noticing if they aren't paying attention to the build messages.

Preferred solution would be to turn this into a hard build failure if --continue is not passed, but I understand that may be too big of a behavioral change. Any method of turning silent build cache errors into build failures would be acceptable.

lptr commented 9 months ago

I think being able to make remote cache failures a hard failure is a good addition. We’ve been going back and forth on this for a long time now. There are teams who have problematic network connections plagued with transient errors on one end of the spectrum, and folks who suffer from hidden cache disablement due to permanent problems on the other. It’s hard to find a good default for everyone, and perhaps the current behavior is not even a good compromise. But making the behavior configurable makes sense.

I wouldn’t tie this to --continue, though. That feature is about functional failures, i.e. tests and whatnot breaking. This on the other hand is about build infra failing.

The way I can imagine this to work is to pass a system property like org.gradle.caching.remote.errors=[FAIL|SKIP|RETRY|IGNORE] or something. The options meaning:

FAIL is the hard failure you are asking for here
SKIP would be the current behavior where the remote cache would be skipped for the remainder of the build (🤔 maybe we could also have an option to disable the remote cache for a set amount of time, even if it spans multiple builds? but I digress)
RETRY I'm not so sure about; we had that in the past, but it can easily backfire. Say, you have a laggy connection while working from a bus on the way to work, or on a plane, and instead of helping you, the remote cache ends up wasting even more of your time than if it just switched off after the first problem.
IGNORE could be an option to just skip the failed cache operation, but keep using the remote cache for later cache operations in the same build.

This is just a quick brainstorm of what options we could support, they don't necessarily all make sense.

And there's also the question about how much it is worth investing in supporting caching over bad quality or problematic network architecture, which is bound to degrade cache performance anyway, making it questionable if caching over such infrastructure is even worth it... 🤔

joshfriend commented 9 months ago

I like your suggestions, and we would probably use the FAIL option for local builds. We occasionally have failures in CI due to S3 ratelimits so the RETRY option could be useful, but it can also be implemented by the cache plugin without much difficulty.

renjfk commented 5 months ago

Having more options would indeed be useful for us, but maybe somewhere more appropriate, e.g. org.gradle.caching.configuration.BuildCacheConfiguration instead of system prop for the sake of type safety. We definitely need an option to retry failed cache fetches. However, how about making the configuration more flexible so that we could have a policy like retry up to 3 times and then fail?

gradle / gradle