spring-cloud / spring-cloud-openfeign

Support for using OpenFeign in Spring Cloud apps
Apache License 2.0
1.22k stars 786 forks source link

`RetryableFeignBlockingLoadBalancerClient` loses the real root cause and presents a spurious `UnknownHostException` when `max-retries-on-same-service-instance` is > 0 #840

Closed oliverlockwood closed 1 year ago

oliverlockwood commented 1 year ago

Describe the bug

Background

The scenario is:

Correct behaviour

14:28:45.068 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which, in my opinion, is perfectly reasonable.

Incorrect behaviour (i.e. the bug)

4:30:46.549 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: downstream-service executing GET http://downstream-service/api/something] with root cause
java.net.UnknownHostException: downstream-service
    at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which, in my opinion, is not acceptable. The actual underlying issue is lost, and in its place there's now a spurious, misleading UnknownHostException, because for some reason the implementation tries to hit the literal URL http://downstream-service/api/something, abandoning the whole concept of service discovery.

Workaround

Always set spring.cloud.loadbalancer.retry.max-retries-on-same-service-instance to 0.

Affected versions

I have explicitly proven that this bug reproduces on the following versions:

However, I suspect this bug has existed ever since the historical migration from Netflix Ribbon Loadbalancer to Spring-Cloud Loadbalancer.

Note: This is effectively a re-raise of #650, which was opened in December 2021, but closed by the maintainers because a detailed reproduction example was not provided at that time.

Sample

This is demonstrated in https://github.com/oliverlockwood/spring-cloud-openfeign-retry-sameserver-bug. Follow the README therein.

OlgaMaciaszek commented 1 year ago

Thanks, @oliverlockwood . We'll take a look at it.

apikozh commented 1 year ago

Heh! @oliverlockwood, overtook a little 😄 I was in process of writing the same bug request in parallel with you. Found it about a week ago.

The problem is caused by the fact that serviceInstance from the retry context is not set to retrievedServiceInstance, and logic assumes that retrieved service instance is null (absent) and uses original request with originalUri. But originalUri contains FeignClient.url or "http://" + FeignClient.name + ":80", if FeignClient.url is not set. Part of RetryableFeignBlockingLoadBalancerClient

            if (context instanceof LoadBalancedRetryContext) {
                LoadBalancedRetryContext lbContext = (LoadBalancedRetryContext) context;
                ServiceInstance serviceInstance = lbContext.getServiceInstance();
                retrievedServiceInstance = serviceInstance; // Should be added!
                if (serviceInstance == null) {
                    ...
                }

                if (retrievedServiceInstance == null) {
...

I'll also leave my sample here, it might be useful: loadbalancer-retry-test.zip

oliverlockwood commented 1 year ago

@apikozh great minds :)

If you've got a proposed fix, why not raise it as a PR?

apikozh commented 1 year ago

Possible fix in #841, but not tested it yet

oliverlockwood commented 1 year ago

For what it's worth, I manually built the fix in #841 locally, and ran my example application using the versions installed from this, and it does appear to me to fix the issue.

I end up with this error log after the retries:

21:11:31.218 [http-nio-1234-exec-2] ERROR o.a.c.c.C.[.[.[.[dispatcherServlet] - Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: Read timed out executing GET http://downstream-service/api/something] with root cause
java.net.SocketTimeoutException: Read timed out
    at java.base/sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:283)
    ...
    at java.base/java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:529)
    at feign.Client$Default.convertResponse(Client.java:110)
    at feign.Client$Default.execute(Client.java:106)
    at org.springframework.cloud.openfeign.loadbalancer.LoadBalancerUtils.executeWithLoadBalancerLifecycleProcessing(LoadBalancerUtils.java:57)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.lambda$execute$2(RetryableFeignBlockingLoadBalancerClient.java:168)
    at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:329)
    at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:225)
    at org.springframework.cloud.openfeign.loadbalancer.RetryableFeignBlockingLoadBalancerClient.execute(RetryableFeignBlockingLoadBalancerClient.java:114)
    at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:121)
    at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:91)
    at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:100)
    at jdk.proxy2/jdk.proxy2.$Proxy107.getMeSomething(Unknown Source)
    at com.oliverlockwood.retrybug.ClientController.hitTheDownstreamService(ClientController.java:20)
    ...

which looks as it should do, correctly exposing the underlying error.

OlgaMaciaszek commented 1 year ago

Hi @oliverlockwood - thanks for providing the sample and verifying the PR. Makes sense.