Update Java SDK retry options for poll operations to match Core SDK.

chronos-tachyon commented 4 months ago

What was changed

Update default initial interval for non-poll retries to 100ms (from 50ms)
Update default maximum interval for non-poll retries to 5s (was 6s)
Update default backoff factor for non-poll retries to 1.7 (was 2.0)
Update default jitter factor for non-poll retries to 20% (was 10%)
Update default initial interval for poll retries to 200ms (was 50ms)
Update default maximum interval for poll retries to 10s (was 60s)
Update default backoff factor for poll retries to 2.0 (was 1.2)
Update default jitter factor for poll retries to 20% (was 10%)
Code cleanups in argument validation code
Default maximum interval is now computed from initial interval as documented (was hardcoded at 6s)

Why?

These changes bring the Java SDK's RPC retry options in line with those of the Core SDK, as the latter have been more carefully reasoned through... with one exception.

The Java poller's existing backoff factor of 1.2 was wildly inappropriate, causing the first 10 retries to hammer the server within less than 1.5 seconds. Even more confusingly, Java's maximum interval was far, far too long, as a 60s delay between the server becoming available and the worker noticing that fact is unreasonable and unnecessary. The poller now uses a 2.0 backoff and 10s cap, exactly in line with Core SDK.

Regarding Java's non-poll RPCs, OTOH, 2.0 is not as aggressive as Core SDK's choice of 1.5, and yet Core is actually in the wrong here: all 10 permitted retries can complete within the 10-second window expected for namespace migrations in Temporal Cloud. 1.7 seems to be the ideal number here, as it leads to a cumulative delay window of 11.8 to 17.8 seconds. I've used 1.7 here, and Core SDK should be updated to use it as well.

Quinn-With-Two-Ns commented 4 months ago

Will need to remember to put a warning in the SDK release notes about changing the defaults

dandavison commented 4 months ago

I appreciate the explanation of this PR in the commit message! After this PR are there any differences between Java tuning and core SDKs, or does this move it to be basically identical?

chronos-tachyon commented 4 months ago

I appreciate the explanation of this PR in the commit message! After this PR are there any differences between Java tuning and core SDKs, or does this move it to be basically identical?

There are some differences if you supply some values but leave others as defaults, and AFAICT Core SDK does not seem to have a concept of throttle/congestion for non-poll RPCs, but the behavior is much, much more similar.

temporalio / sdk-java

Update Java SDK retry options for poll operations to match Core SDK. #1989

What was changed

Why?