grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.26k stars 3.79k forks source link

Thick client side load balancing without using load balancer #11151

Open archit-harness opened 2 months ago

archit-harness commented 2 months ago

What is the best practice to use client-side load balancing? I read through this thread - https://github.com/grpc/grpc-java/issues/428 and various options provided. I found one example - https://github.com/grpc/grpc-java/blob/master/examples/src/main/java/io/grpc/examples/customloadbalance/CustomLoadBalanceClient.java

But don't think its dynamically updating the endpoints of the servers based on scale up/down.

Trying to follow - https://github.com/sercand/kuberesolver where they are using Kubernetes API to watch for IPs and updating the endpoints to do round-robin load balancing and making server as headless service.

Found other blogs having different approaches- https://medium.com/jamf-engineering/how-three-lines-of-configuration-solved-our-grpc-scaling-issues-in-kubernetes-ca1ff13f7f06

But if I close the gRPC connection I do not benefit from the long-lived connection provided by gRPC.

Let me know your thoughts, what is the preferred approach for client-side load balancing?

ejona86 commented 1 month ago

For k8s assuming you want L7 load balancing, it is normal to use a headless service with the round_robin load balancer. That can be done by calling channelBuilder.defaultLoadBalancingPolicy("round_robin").

When using DNS to resolve addresses, yes, you will want to configure serverBuilder.maxConnectionAge() on your server to occasionally cycle connections so that the client re-resolves DNS addresses. I'd suggest using an age of some number of minutes.

(That approach would work well for L4 load balancing as well, but you'd use pick_first with shuffleAddressList enabled.)

As a slightly more advanced alternative, using the k8s watch API can work well, and with that approach you don't need to use max connection age. We don't have a built-in implementation of the k8s watch, but there are examples floating around. The NameResolver API is experimental and we know we will change it in some ways before marking it stable, but such changes we work to make easy to absorb, as we know people are using the API.

But if I close the gRPC connection I do not benefit from the long-lived connection provided by gRPC.

This isn't really a problem in practice. The amortized cost of the connection is pretty low, as long as you don't get very aggressive on the max connection age.

archit-harness commented 1 month ago

Thanks for the recommendation, we are trying this out and update here.

ejona86 commented 1 month ago

Seems like this is resolved. If you end up having trouble, comment, and it can be reopened.

archit-harness commented 1 month ago

@ejona86 one more thing was to check with you, when we are using round robin and headless service, the client will refer to server as DNS:///headless-service: or just headless-service: ?

ejona86 commented 1 month ago

@archit-harness, those are generally equivalent. gRPC detects headless-service: is incomplete and prefixes it with "dns:///" (the default for most systems). The "canonical" form is "dns:///headless-service" (with or without port). No scheme prefix is a short form.

(In the olden days we only supported host:port, but when we added name resolver support which used the scheme we tried to detect if it was old-form and convert it into new-form. But the old-form is also useful as a shorthand.)

archit-harness commented 1 month ago

@ejona86 as per this blog - https://itnext.io/grpc-name-resolution-load-balancing-everything-you-need-to-know-and-probably-a-bit-more-77fc0ae9cd6c It says the default scheme used is passthrough, The definition - Passthrough (default): Just returns the target provided by the ClientConn without any specific logic.

So i didn't get as its specifically mentioned DNS separately, so wanted to confirm will both return same results?

ejona86 commented 1 month ago

That is grpc-go-specific. Go (against my recommendations) didn't use the new form (there were some technical issues, but in my mind they had easy solutions). Although Go now does do what I mentioned if using grpc.NewClient() instead of grpc.Dial() (deprecated). That is a very recent development.

archit-harness commented 1 month ago

@ejona86 thanks for the clarification, so without DNS prefix also it will resolve the same. 👍

archit-harness commented 1 week ago

Hi @ejona86 we implemented the changes and it looks load balancing is working fine. But as mentioned here - https://medium.com/jamf-engineering/how-three-lines-of-configuration-solved-our-grpc-scaling-issues-in-kubernetes-ca1ff13f7f06

We are facing grpc UNAVAILABLE error during rolling updates.

But i don't get one thing - we have preStop hooks on our pods, which ensures pod will be live for 60 sec at least. and as per blog, DNS caches refresh after 30 sec. We are still seeing those errors. not sure how minReadySeconds will help to mitigate the issue, as i understand the issue happens if DNS returns IP of old pod which is died down, which wont happen for our case if time period is 30sec.

Is there any case which i am missing?

Also errors are of different types - a) Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: pipeline-service-headless/: Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused

b) Exception occurred: StatusRuntimeException: UNAVAILABLE: io exception; Cause: NativeIoException: recvAddress(..) failed: Connection timed out

ejona86 commented 5 days ago

I had given a few options. Are you using pick-first or round-robin?

"Connection timed out". This means you aren't likely doing a graceful shutdown. Normal processing for PreStop hook would be to do something like:

// Stop accepting new RPCs and new connections
server.shutdown();
// Wait for already-created RPCs to complete. Returns as soon all RPCs complete
server.awaitTermination(30, TimeUnit.SECONDS);
// Now kill any outstanding RPCs to allow the server to shut down
server.shutdownNow();
// And give it a little bit of time to process. This should return pretty quickly
server.awaitTermination(5, TimeUnit.SECONDS);

The client will start reconnecting as soon as server.shutdown() is called.

In that medium post when it uses .spec.minReadySeconds = 30, the new pod comes up and clients will already start receiving the new DNS results when the old pod begins its shut down. There's a race between resolving new DNS results and reconnecting; reconnecting might first use the old DNS addresses (and then try again once the updated addresses are known). That medium post may be avoiding that issue since MAX_CONNECTION_AGE=30s matches minReadySeconds, so when an old pod begins shutting down all clients already have the new pod's IP address.

archit-harness commented 4 days ago

Hi @ejona86 i am using round robin.

So, i will try shutting down the server, which i am not doing. Could be potential reason for it.

Also, what do you think is the recommended approach. Do you think we should use minReadySeconds vs using correct proper shutdown of RPCs.

Also, should we look at something else to add to handle any other edge cases?

jasonmcintosh commented 4 days ago

Wrote up some tests... right now, it SEEMS like if retry is set, UNAVAILABLE responses retry against a different host when using load balancing. However, if we don't have retry set (and I use an intentional UNAVAILABLE status code on the server), the requests fail. IF we set a default retry policy to retry on UNAVAILABLE, I'm seeing via tests that the load balancing correctly re-route and tried on a different host. It's been hard to simulate a server disconnect in an integration test without some magic but I've KINDA tested the situation). I've got a sever that sets

        responseObserver.onError(Status.UNAVAILABLE.withDescription("We are a failing server").asRuntimeException());

When "told to fail" as part of the response processor to try to simulate the "UNAVAILABLE" state. It's not as good as an iptables rule to drop the request silently but... ;)

I'm wondering if there's something missing on the retry where a bad server stays in the provider list somehow. Logs where we've seen this in "real life" before the retry (and stack trace):

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
    at io.harness.pms.contracts.service.OutcomeProtoServiceGrpc$OutcomeProtoServiceBlockingStub.resolveOptional(OutcomeProtoServiceGrpc.java)
....
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: internal-service-headless/10.36.29.215:12011
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
    at io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:833)

This call is repeated 3 times using net.jodah.failsafe Retry handler and all three requests land on the same "bad node" aka that's being shutdown. I DO know the host disappeared from access logs shortly after we saw these - but we THOUGHT that the DnsProvider, with the default cache TTL in GRPC of 30 seconds, and some of the default transparent retries on these kind of situations would handle these failures & reroute. It doesn't seem like it did work this way. I'm wondering if the health checking load balancer would handle this differently/better? OR if there's a config missing or something else going on. FYI looks like version 1.59 of the grpc-java libraries, not seen much in changelog or other notes.

ejona86 commented 4 days ago

Do you think we should use minReadySeconds vs using correct proper shutdown of RPCs.

Definitely have graceful shutdown. Anything else would be in addition to that. If you see "Connection refused" errors, with round-robin that means the client didn't receive the new pod IPs. We expect gRPC to see some connection refused errors, but we don't expect those to cause RPCs to fail; gRPC will use other addresses instead. If you RPCs failing with connection refused that means all addresses aren't working, which likely means all the addresses are old. minReadySeconds of 30 seconds could indeed help in that situation. How much you need minReadySeconds is a function of how many server pods you have, with fewer pods benefiting from it more.

Also, should we look at something else to add to handle any other edge cases?

Use MAX_CONNECTION_AGE if you aren't already. That will help with scale-up events.

jasonmcintosh commented 4 days ago

SO did some more testing. I'll try to share the test code where we startup multiple GRPC servers & run some tests. Turns out a few quirks:

jasonmcintosh commented 4 days ago
    "methodConfig" : [{
      "name" : [{}],
      "waitForReady" : true,
      "retryPolicy" : {
        "maxAttempts" : 5,
        "initialBackoff" : "0.1s",
        "maxBackoff" : "1s",
        "backoffMultiplier" : 2,
        "retryableStatusCodes" : ["UNAVAILABLE"]
      }
    }]

FYI tests ALSO showed that something like this service config seems to be required to handle the server unavailable state without failures.

jasonmcintosh commented 4 days ago

Ok the retries help if the server EXPLICITLY returns that code but is still connected. I was trying to use the following to test a "bad server" state (OOM kinda thing or something else):

  private static class GrpcClientTestServer extends GrpcClientTestGrpc.GrpcClientTestImplBase {
    private final int serverNumber;
    boolean shouldFail = false;
    public GrpcClientTestServer(int serverNumber) {
      this.serverNumber = serverNumber;
    }

    @Override
    public void sayHello(GrpcClientRequest request, StreamObserver<GrpcClientResponse> responseObserver) {
      if (shouldFail) {
        responseObserver.onError(Status.UNAVAILABLE.withDescription("We are a failing server").asRuntimeException());
      }
      responseObserver.onNext(GrpcClientResponse.newBuilder()
                                  .setMessage("hello '" + request.getName() + "' from the server " + serverNumber + "!")
                                  .build());
      responseObserver.onCompleted();
    }
  }

I'm guessing this isn't the "right" wait to signal a "BAD" situation on the server side.

jasonmcintosh commented 4 days ago

https://github.com/jasonmcintosh/bug-reports/blob/grpcJavaTests/grpc-java/load-balancing-tests/ I pushed up full code of some of the integration type tests running to test load balancing behavior, failure states, etc. to see how things were handled. I'm pretty sure I missed some basics on the failure handling here.

jasonmcintosh commented 4 days ago

Looking - I've not found a good way to hook in a listener/metric on resolver resolution to see if DNS is the situation. Internally I may replicate the DnsResolutionProvider to inject a custom listeners on resolution, unless there's a cleaner way? Adding the concept of a "name resolver resolution listener" would help so we could inject log output or add metrics on a name resolution change. I'm pretty sure thee's SOMETHING on the config or DNS side but additional logs would help confirm this. Existing logs don't seem to log the addresses which would help confirm "we got multiple addresses" nor when a refresh happens. I will look at a PR to add some additional "FINEST" level logging on address resolution if this would help?

archit-harness commented 3 days ago

Hi @ejona86 i am using MAX_CONNECTION_AGE = 30, as already shared by you in start.

I confirmed, that getting UNAVAILABLE errors during pod stop only.

Also, we have shutdown hook on grpc servers, but those are implemented like below

Runtime.getRuntime().addShutdownHook(new Thread(() -> serviceManager.stopAsync().awaitStopped()));

So it might be happening in async and thus its not honoring? Do you recommend explicit waiting on thread to shutdown as shared by you - https://github.com/grpc/grpc-java/issues/11151#issuecomment-2204241930 ?

archit-harness commented 3 days ago

One more observation would like to add, this is happening during scale down of pods as well, and exception is as below

io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
    at io.harness.gitsync.HarnessToGitPushInfoServiceGrpc$HarnessToGitPushInfoServiceBlockingStub.getFile(HarnessToGitPushInfoServiceGrpc.java:986)
    at io.harness.gitsync.common.helper.GitSyncGrpcClientUtils.lambda$retryAndProcessExceptionV2$1(GitSyncGrpcClientUtils.java:44)
    at net.jodah.failsafe.Functions.lambda$resultSupplierOf$11(Functions.java:284)
    at net.jodah.failsafe.RetryPolicyExecutor.lambda$supply$0(RetryPolicyExecutor.java:63)
    at net.jodah.failsafe.Execution.executeSync(Execution.java:129)
    at net.jodah.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:379)
    at net.jodah.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:70)
    at io.harness.gitsync.common.helper.GitSyncGrpcClientUtils.retryAndProcessExceptionV2(GitSyncGrpcClientUtils.java:44)
    at io.harness.gitsync.scm.SCMGitSyncHelper.getFileByBranch(SCMGitSyncHelper.java:150)
    at io.harness.gitaware.helper.GitAwareEntityHelper.fetchEntityFromRemote(GitAwareEntityHelper.java:85)
    at io.harness.gitaware.helper.GitAwareEntityHelper$$FastClassBySpringCGLIB$$d8da3501.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    at org.springframework.aop.framework.CglibAopProxy.invokeMethod(CglibAopProxy.java:386)
    at org.springframework.aop.framework.CglibAopProxy.access$000(CglibAopProxy.java:85)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:704)
    at io.harness.gitaware.helper.GitAwareEntityHelper$$EnhancerBySpringCGLIB$$2e22fd54.fetchEntityFromRemote(<generated>)
    at io.harness.repositories.pipeline.PMSPipelineRepositoryCustomImpl.fetchRemoteEntity(PMSPipelineRepositoryCustomImpl.java:388)
    at io.harness.repositories.pipeline.PMSPipelineRepositoryCustomImpl.find(PMSPipelineRepositoryCustomImpl.java:321)
    at jdk.internal.reflect.GeneratedMethodAccessor564.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.springframework.data.repository.core.support.RepositoryMethodInvoker$RepositoryFragmentMethodInvoker.lambda$new$0(RepositoryMethodInvoker.java:289)
    at org.springframework.data.repository.core.support.RepositoryMethodInvoker.doInvoke(RepositoryMethodInvoker.java:137)
    at org.springframework.data.repository.core.support.RepositoryMethodInvoker.invoke(RepositoryMethodInvoker.java:121)
    at org.springframework.data.repository.core.support.RepositoryComposition$RepositoryFragments.invoke(RepositoryComposition.java:530)
    at org.springframework.data.repository.core.support.RepositoryComposition.invoke(RepositoryComposition.java:286)
    at org.springframework.data.repository.core.support.RepositoryFactorySupport$ImplementationMethodExecutionInterceptor.invoke(RepositoryFactorySupport.java:640)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:164)
    at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:139)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at io.opentelemetry.javaagent.instrumentation.spring.data.v1_8.SpringDataInstrumentationModule$RepositoryInterceptor.invoke(SpringDataInstrumentationModule.java:111)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:220)
    at jdk.proxy3/jdk.proxy3.$Proxy275.find(Unknown Source)
    at io.harness.pms.pipeline.service.PMSPipelineServiceImpl.getPipeline(PMSPipelineServiceImpl.java:536)
    at io.harness.pms.pipeline.service.PMSPipelineServiceImpl.getAndValidatePipeline(PMSPipelineServiceImpl.java:446)
    at io.harness.pms.pipeline.service.PMSPipelineServiceImpl.getAndValidatePipeline(PMSPipelineServiceImpl.java:413)
    at io.harness.pms.pipeline.resource.PipelineResourceImpl.getPipelineByIdentifier(PipelineResourceImpl.java:303)
    at io.harness.pms.pipeline.resource.PipelineResourceImpl$$EnhancerByGuice$$462297078.GUICE$TRAMPOLINE(<generated>)
    at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:74)
    at io.harness.accesscontrol.NGAccessControlCheckHandler.invoke(NGAccessControlCheckHandler.java:67)
    at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:75)
    at com.google.inject.internal.InterceptorStackCallback.invoke(InterceptorStackCallback.java:55)
    at io.harness.pms.pipeline.resource.PipelineResourceImpl$$EnhancerByGuice$$462297078.getPipelineByIdentifier(<generated>)
    at jdk.internal.reflect.GeneratedMethodAccessor1063.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
    at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:680)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
    at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
    at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:35)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:47)
    at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:41)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at io.harness.filter.HttpServiceLoopDetectionFilter.doFilter(HttpServiceLoopDetectionFilter.java:52)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:319)
    at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:273)
    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at com.codahale.metrics.jetty9.InstrumentedHandler.handle(InstrumentedHandler.java:313)
    at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
    at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:181)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.eclipse.jetty.server.Server.handle(Server.java:516)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: internal-headless/x.x.x.x:<port>
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
    at io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:833)

at same time, the IP coming in error was scaling down via HPA.