temporalio / sdk-java

Temporal Java SDK
https://temporal.io
Apache License 2.0
200 stars 134 forks source link

Update Java SDK retry options for poll operations to match Core SDK. #1989

Closed chronos-tachyon closed 4 months ago

chronos-tachyon commented 4 months ago

What was changed

Why?

These changes bring the Java SDK's RPC retry options in line with those of the Core SDK, as the latter have been more carefully reasoned through... with one exception.

The Java poller's existing backoff factor of 1.2 was wildly inappropriate, causing the first 10 retries to hammer the server within less than 1.5 seconds. Even more confusingly, Java's maximum interval was far, far too long, as a 60s delay between the server becoming available and the worker noticing that fact is unreasonable and unnecessary. The poller now uses a 2.0 backoff and 10s cap, exactly in line with Core SDK.

Regarding Java's non-poll RPCs, OTOH, 2.0 is not as aggressive as Core SDK's choice of 1.5, and yet Core is actually in the wrong here: all 10 permitted retries can complete within the 10-second window expected for namespace migrations in Temporal Cloud. 1.7 seems to be the ideal number here, as it leads to a cumulative delay window of 11.8 to 17.8 seconds. I've used 1.7 here, and Core SDK should be updated to use it as well.

Quinn-With-Two-Ns commented 4 months ago

Will need to remember to put a warning in the SDK release notes about changing the defaults

dandavison commented 4 months ago

I appreciate the explanation of this PR in the commit message! After this PR are there any differences between Java tuning and core SDKs, or does this move it to be basically identical?

chronos-tachyon commented 4 months ago

I appreciate the explanation of this PR in the commit message! After this PR are there any differences between Java tuning and core SDKs, or does this move it to be basically identical?

There are some differences if you supply some values but leave others as defaults, and AFAICT Core SDK does not seem to have a concept of throttle/congestion for non-poll RPCs, but the behavior is much, much more similar.