GRPC traffic is not load balanced correctly

deenav commented 5 years ago

While running longevity test in PKS cluster with moderate IO (medium-scale) workload, Observing time lapses in the controller logs whenever there are more than one controller in the PKS cluster.

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Pravega version: zk-closed-client-issue-0.5.0-2161.60655bf
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: adrianmo/pravega-operator:issue-134-5

Steps to reproduce

Deploy Pravega-operator and Pravega cluster with minimum 2 controller instances on PKS Cluster.
Start longevity test with moderate IO (medium-scale) workload and observe that both controllers are serving read/write requests from the longevity run. Use kubectl logs -f po/<controller-pod-name> command to monitor the controller logs.
Induce one controller restart using kubectl exec -it <controller1-pod-name> reboot command.
After the reboot, when controller completes it's initial configuration stages like ControllerServiceStarter STARTING then it is not logging anything and not serving read/write requests from the client.
Restarted pod remains idle until the active controller pod gets some kind of error/event (I tested by triggering reboot on the active controller ) which makes the idle controller to resume it's services.

Snip of time lapses in controller log:-

2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller event processors
2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller cluster listener
2019-03-21 10:20:57,456 4193 [controllerpool-16] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders]
2019-03-21 10:20:57,456 4193 [controllerpool-19] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[scaleGroup]
2019-03-21 10:20:57,456 4193 [controllerpool-30] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[commitStreamReaders]
2019-03-21 11:53:37,489 5564226 [controllerpool-18] INFO  i.p.c.cluster.zkImpl.ClusterZKImpl - Node 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4 removed from cluster
2019-03-21 11:53:37,492 5564229 [controllerpool-18] INFO  i.p.c.f.ControllerClusterListener - Received controller cluster event: HOST_REMOVED for host: 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4
2019-03-21 11:53:37,494 5564231 [Curator-LeaderSelector-0] INFO  i.p.c.fault.SegmentMonitorLeader - Obtained leadership to monitor the Host to Segment Container Mapping

Problem location My observation here is, it seems load balancing not happening properly when restarted/failed controller resumes it's operation if more than one controller in the cluster.

deenav commented 5 years ago

Attaching controller logs controller-logs.zip

adrianmo commented 5 years ago

@deenav can you share the client configuration? I would like to know how the Pravega endpoint is used in the client application. Thanks.

deenav commented 5 years ago

@adrianmo I was running pravega-longevity medium-scale workload on my cluster using

./tools/longevity_tests.py --image-repository=devops-repo.isus.emc.com:8116/
pravega-longevity --image-tag=18.ed0bc5e-0.5.0-2151.2e8f0b7 --controller-uri tcp://10.100.200.126:9090 --test=medium-scale deploy

adrianmo commented 5 years ago

@deenav please try to reproduce the error using the Controller service name instead of the IP address in the --controller-uri parameter.

Tristan1900 commented 5 years ago

I was able to reproduce the similar error yesterday but it suddenly worked today and I cannot reproduce it again in the past hours. It will be harder to debug if this error happens intermittently. It seems that this is a load balance problem as deenav suggests because I have been looking into the controller logs but nothing suspicious occurs. I was misled by those messages Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders] because this failed process has been addressed by another controller already. But it isn't the cause of our problem.

deenav commented 5 years ago

@adrianmo It seems pravega-longevity does not work with Controller service name as --controller-uri. It's throwing Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list exception and not starting IO's.

Exception in thread "Thread-32" io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNAVAILABLE: Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list
        at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$6(Retry.java:245)
        at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source)
        at java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
        at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
        at io.pravega.client.stream.impl.ControllerImpl$RPCAsyncCallback.onError(ControllerImpl.java:1016)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

Raghuveera-Rao commented 5 years ago

one Controller pod in inactive state after deleting one of the controller(out of 2).

Pravega version: 0.5.0-2236.5228e2d
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: pravega/pravega-operator:0.3.2

If you delete the controller pod (which is properly running), then other controller (which was stuck) starts to take the new i/o.

Raghuveera-Rao commented 5 years ago

contr2_1.log

adrianmo commented 5 years ago

After investigating, I'm quite sure that the following is happening:

1- When the longevity process starts, it establishes multiple GRPC (HTTP2) connections to the Controller service, which are forwarded to the backend pods in a round robin fashion. Therefore, some connections are established with Controller 1, and some other with Controller 2. 2- When you restart Controller 1, all the connections to that pod are dropped and reestablished to the service which forwards it to the only pod available, Controller 2. At this point, Controller 2 is handling all connections from the longevity process. 3- When Controller 1 comes up and running again, it does not receive any traffic because all connections are already established with Controller 2. This is why Controller 1 doesn't manifest any activity and seems idle.
4- Eventually, when Controller 2 faces an issue or is deleted, connections are reestablished with the service and forwarded to Controller 1. Then you start seeing activity again.

This happens because HTTP/2 multiplexes requests into a single TCP connection, as opposed to HTTP/1, which establishes a connection for each request.

Related documents:

pbelgundi commented 5 years ago

Based on current operator/controller code, behavior exhibited by controller is the expected behavior. When out of 2 controllers say 'a' and 'b', if 'a' goes down, all new client connections are redirected by load-balancer to controller 'b' and later when 'a' comes up, since the connections to 'b' are persistent, there is no way to automatically 'shift' these from 'b' to 'a'. Currently no attempt to do this is made by operator/ controller, by design.

If such load balancing is a requirement, please file a new feature request for the same.

ravisharda commented 5 years ago

@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.

The traffic to Controller pods in Pravega Operator managed Kubernetes deployments is not load-balanced due to the choice of load balancing mechanism used in the Operator: service type of load balancer, which represents a layer 4 load balancer L4 LB).

An L4 LB cannot load balance gRPC traffic, as pointed out by references listed by @adrianmo above. Load balancing of gRPC traffic requires request-level load balancing, which is supported by some of the Ingress controllers like Ingress-nginx (which in turn spin up application-level/layer 7 load balancers).

Whether to use sticky sessions or load balance across all nodes is usually a policy decision represented as rules/configuration in a load balancer. In this case, the choice of the load balancer itself has restricted the options.

It is not by design that gRPC load balancing was left out. It was just that the problem was realized later. @Adrian and I had discussed this in length some time back, and he did some investigation and posted his findings above.

I just want to share the context. In my view too, this is an enhancement.

pbelgundi commented 5 years ago

@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.

No. What I have mentioned is not about "failover". That is working as expected. This issue is about uneven load-balancing post "failover". Current operator takes care of correctly load balancing in a normal (non-failover) situation.

There sure are better ways to do grpc load balancing so it works post "failover" as well. Its just that current operator does not attempt to do it and needs to be taken up as a feature/enhancement.

RaulGracia commented 5 years ago

@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for 0.3.2. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment (relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.

pbelgundi commented 5 years ago

@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for 0.3.2. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment (relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.

I'm fine with re-labelling too. This would definitely require fix on operator.

shiveshr commented 5 years ago

Right now, by virtue of using grpc we load balance requests on a per client basis. This is by design choice of using grpc which uses HTTP/2 as transport and multiplexes multiple requests over a persistent connection. So this design choice has the limitation that once a client connects to a controller, its a sticky connection until that connection is disrupted.

As an improvement we should definitely load balance the "requests" and not just connections across available controller instances. But that will require a broader discussion. We can keep this issue and keep that discussion in this context but we should move it to 0.6 and as a new feature.

pbelgundi commented 4 years ago

Some products that offer load balancing capabilities for GRPC traffic:

https://github.com/jtattermusch/grpc-loadbalancing-kubernetes-examples https://github.com/jtattermusch/grpc-loadbalancing-kubernetes-examples/blob/master/grpc_loadbalancing_kubernetes_slides.pdf

pravega / pravega-operator

GRPC traffic is not load balanced correctly #142