Closed oliverlockwood closed 1 year ago
Thanks, @oliverlockwood . We'll take a look at it.
Heh! @oliverlockwood, overtook a little 😄 I was in process of writing the same bug request in parallel with you. Found it about a week ago.
The problem is caused by the fact that serviceInstance from the retry context is not set to retrievedServiceInstance, and logic assumes that retrieved service instance is null (absent) and uses original request with originalUri. But originalUri contains FeignClient.url or "http://" + FeignClient.name + ":80", if FeignClient.url is not set. Part of RetryableFeignBlockingLoadBalancerClient
if (context instanceof LoadBalancedRetryContext) {
LoadBalancedRetryContext lbContext = (LoadBalancedRetryContext) context;
ServiceInstance serviceInstance = lbContext.getServiceInstance();
retrievedServiceInstance = serviceInstance; // Should be added!
if (serviceInstance == null) {
...
}
if (retrievedServiceInstance == null) {
...
I'll also leave my sample here, it might be useful: loadbalancer-retry-test.zip
@apikozh great minds :)
If you've got a proposed fix, why not raise it as a PR?
Possible fix in #841, but not tested it yet
For what it's worth, I manually built the fix in #841 locally, and ran my example application using the versions installed from this, and it does appear to me to fix the issue.
I end up with this error log after the retries:
21:11:31.218 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
...
at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
at feign.Client$Default.convertResponse(Client.java:110)
at feign.Client$Default.execute(Client.java:106)
at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
...
which looks as it should do, correctly exposing the underlying error.
Hi @oliverlockwood - thanks for providing the sample and verifying the PR. Makes sense.
Describe the bug
Background
The scenario is:
@FeignClient(name = "downstream-service")
) rather than by URLRetryableFeignBlockingLoadBalancerClient
)Correct behaviour
spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance
is set to 0max-retries-on-next-service-instance
, and then at the end logs an error like this:which, in my opinion, is perfectly reasonable.
Incorrect behaviour (i.e. the bug)
spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance
is set to 1 or highermax-retries-on-same-service-instance
, andmax-retries-on-next-service-instance
, and then at the end logs an error like this:which, in my opinion, is not acceptable. The actual underlying issue is lost, and in its place there's now a spurious, misleading
UnknownHostException
, because for some reason the implementation tries to hit the literal URLhttp://downstream-service/api/something
, abandoning the whole concept of service discovery.Workaround
Always set
spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance
to 0.Affected versions
I have explicitly proven that this bug reproduces on the following versions:
3.0.3
& Spring-Cloud2022.0.1
2.7.9
& Spring-Cloud2021.0.6
2.7.7
& Spring-Cloud2021.0.5
2.6.2
& Spring-Cloud2021.0.0
However, I suspect this bug has existed ever since the historical migration from Netflix Ribbon Loadbalancer to Spring-Cloud Loadbalancer.
Note: This is effectively a re-raise of #650, which was opened in December 2021, but closed by the maintainers because a detailed reproduction example was not provided at that time.
Sample
This is demonstrated in https://github.com/oliverlockwood/spring-cloud-openfeign-retry-sameserver-bug. Follow the README therein.