oliverlockwood commented 1 year ago

Describe the bug

Background

The scenario is:

using a Feign client that is configured by name (e.g. @FeignClient(name = "downstream-service")) rather than by URL
using Spring-Cloud Loadbalancer with retry enabled (i.e. we're using the RetryableFeignBlockingLoadBalancerClient)
the downstream service instance(s) are taking longer to respond than the configured Feign client timeout used by the client application.

Correct behaviour

Occurs when spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance is set to 0
The client application retries according to the configuration of max-retries-on-next-service-instance, and then at the end logs an error like this:

14:28:45.068 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which, in my opinion, is perfectly reasonable.

Incorrect behaviour (i.e. the bug)

Occurs when spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance is set to 1 or higher
The client application retries according to the configuration of max-retries-on-same-service-instance, and max-retries-on-next-service-instance, and then at the end logs an error like this:

4:30:46.549 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: downstream-service executing GET http://downstream-service/api/something] with root cause
java.net.UnknownHostException: downstream-service
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which, in my opinion, is not acceptable. The actual underlying issue is lost, and in its place there's now a spurious, misleading UnknownHostException, because for some reason the implementation tries to hit the literal URL http://downstream-service/api/something, abandoning the whole concept of service discovery.

Workaround

Always set spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance to 0.

Affected versions

I have explicitly proven that this bug reproduces on the following versions:

Spring-Boot 3.0.3 & Spring-Cloud 2022.0.1
Spring-Boot 2.7.9 & Spring-Cloud 2021.0.6
Spring-Boot 2.7.7 & Spring-Cloud 2021.0.5
Spring-Boot 2.6.2 & Spring-Cloud 2021.0.0

However, I suspect this bug has existed ever since the historical migration from Netflix Ribbon Loadbalancer to Spring-Cloud Loadbalancer.

Note: This is effectively a re-raise of #650, which was opened in December 2021, but closed by the maintainers because a detailed reproduction example was not provided at that time.

Sample

This is demonstrated in https://github.com/oliverlockwood/spring-cloud-openfeign-retry-sameserver-bug. Follow the README therein.

OlgaMaciaszek commented 1 year ago

Thanks, @oliverlockwood . We'll take a look at it.

apikozh commented 1 year ago

Heh! @oliverlockwood, overtook a little 😄 I was in process of writing the same bug request in parallel with you. Found it about a week ago.

The problem is caused by the fact that serviceInstance from the retry context is not set to retrievedServiceInstance, and logic assumes that retrieved service instance is null (absent) and uses original request with originalUri. But originalUri contains FeignClient.url or "http://" + FeignClient.name + ":80", if FeignClient.url is not set. Part of RetryableFeignBlockingLoadBalancerClient

            if (context instanceof LoadBalancedRetryContext) {
                LoadBalancedRetryContext lbContext = (LoadBalancedRetryContext) context;
                ServiceInstance serviceInstance = lbContext.getServiceInstance();
                retrievedServiceInstance = serviceInstance; // Should be added!
                if (serviceInstance == null) {
                    ...
                }

                if (retrievedServiceInstance == null) {
...

I'll also leave my sample here, it might be useful: loadbalancer-retry-test.zip

oliverlockwood commented 1 year ago

@apikozh great minds :)

If you've got a proposed fix, why not raise it as a PR?

apikozh commented 1 year ago

Possible fix in #841, but not tested it yet

oliverlockwood commented 1 year ago

For what it's worth, I manually built the fix in #841 locally, and ran my example application using the versions installed from this, and it does appear to me to fix the issue.

I end up with this error log after the retries:

21:11:31.218 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which looks as it should do, correctly exposing the underlying error.

OlgaMaciaszek commented 1 year ago

Hi @oliverlockwood - thanks for providing the sample and verifying the PR. Makes sense.

spring-cloud / spring-cloud-openfeign

`RetryableFeignBlockingLoadBalancerClient` loses the real root cause and presents a spurious `UnknownHostException` when `max-retries-on-same-service-instance` is > 0 #840

Describe the bug

Background

Correct behaviour

Incorrect behaviour (i.e. the bug)

Workaround

Affected versions

Sample