Open deenav opened 5 years ago
Attaching controller logs controller-logs.zip
@deenav can you share the client configuration? I would like to know how the Pravega endpoint is used in the client application. Thanks.
@adrianmo I was running pravega-longevity medium-scale
workload on my cluster using
./tools/longevity_tests.py --image-repository=devops-repo.isus.emc.com:8116/
pravega-longevity --image-tag=18.ed0bc5e-0.5.0-2151.2e8f0b7 --controller-uri tcp://10.100.200.126:9090 --test=medium-scale deploy
@deenav please try to reproduce the error using the Controller service name instead of the IP address in the --controller-uri
parameter.
I was able to reproduce the similar error yesterday but it suddenly worked today and I cannot reproduce it again in the past hours. It will be harder to debug if this error happens intermittently.
It seems that this is a load balance problem as deenav suggests because I have been looking into the controller logs but nothing suspicious occurs.
I was misled by those messages Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders]
because this failed process has been addressed by another controller already. But it isn't the cause of our problem.
@adrianmo It seems pravega-longevity
does not work with Controller service name as --controller-uri
. It's throwing Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list
exception and not starting IO's.
Exception in thread "Thread-32" io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNAVAILABLE: Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list
at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$6(Retry.java:245)
at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source)
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
at io.pravega.client.stream.impl.ControllerImpl$RPCAsyncCallback.onError(ControllerImpl.java:1016)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
one Controller pod in inactive state after deleting one of the controller(out of 2).
Pravega version: 0.5.0-2236.5228e2d
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: pravega/pravega-operator:0.3.2
If you delete the controller pod (which is properly running), then other controller (which was stuck) starts to take the new i/o.
After investigating, I'm quite sure that the following is happening:
1- When the longevity process starts, it establishes multiple GRPC (HTTP2) connections to the Controller service, which are forwarded to the backend pods in a round robin fashion. Therefore, some connections are established with Controller 1, and some other with Controller 2.
2- When you restart Controller 1, all the connections to that pod are dropped and reestablished to the service which forwards it to the only pod available, Controller 2. At this point, Controller 2 is handling all connections from the longevity process.
3- When Controller 1 comes up and running again, it does not receive any traffic because all connections are already established with Controller 2. This is why Controller 1 doesn't manifest any activity and seems idle.
4- Eventually, when Controller 2 faces an issue or is deleted, connections are reestablished with the service and forwarded to Controller 1. Then you start seeing activity again.
This happens because HTTP/2 multiplexes requests into a single TCP connection, as opposed to HTTP/1, which establishes a connection for each request.
Related documents:
Based on current operator/controller code, behavior exhibited by controller is the expected behavior. When out of 2 controllers say 'a' and 'b', if 'a' goes down, all new client connections are redirected by load-balancer to controller 'b' and later when 'a' comes up, since the connections to 'b' are persistent, there is no way to automatically 'shift' these from 'b' to 'a'. Currently no attempt to do this is made by operator/ controller, by design.
If such load balancing is a requirement, please file a new feature request for the same.
@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.
The traffic to Controller pods in Pravega Operator managed Kubernetes deployments is not load-balanced due to the choice of load balancing mechanism used in the Operator: service type of load balancer, which represents a layer 4 load balancer L4 LB).
An L4 LB cannot load balance gRPC traffic, as pointed out by references listed by @adrianmo above. Load balancing of gRPC traffic requires request-level load balancing, which is supported by some of the Ingress controllers like Ingress-nginx (which in turn spin up application-level/layer 7 load balancers).
Whether to use sticky sessions or load balance across all nodes is usually a policy decision represented as rules/configuration in a load balancer. In this case, the choice of the load balancer itself has restricted the options.
It is not by design that gRPC load balancing was left out. It was just that the problem was realized later. @Adrian and I had discussed this in length some time back, and he did some investigation and posted his findings above.
I just want to share the context. In my view too, this is an enhancement.
@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.
No. What I have mentioned is not about "failover". That is working as expected. This issue is about uneven load-balancing post "failover". Current operator takes care of correctly load balancing in a normal (non-failover) situation.
There sure are better ways to do grpc load balancing so it works post "failover" as well. Its just that current operator does not attempt to do it and needs to be taken up as a feature/enhancement.
@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for 0.3.2
. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment
(relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.
@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for
0.3.2
. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment (relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.
I'm fine with re-labelling too. This would definitely require fix on operator.
Right now, by virtue of using grpc we load balance requests on a per client basis. This is by design choice of using grpc which uses HTTP/2 as transport and multiplexes multiple requests over a persistent connection. So this design choice has the limitation that once a client connects to a controller, its a sticky connection until that connection is disrupted.
As an improvement we should definitely load balance the "requests" and not just connections across available controller instances. But that will require a broader discussion. We can keep this issue and keep that discussion in this context but we should move it to 0.6 and as a new feature.
Some products that offer load balancing capabilities for GRPC traffic:
https://github.com/jtattermusch/grpc-loadbalancing-kubernetes-examples https://github.com/jtattermusch/grpc-loadbalancing-kubernetes-examples/blob/master/grpc_loadbalancing_kubernetes_slides.pdf
While running longevity test in PKS cluster with moderate IO (medium-scale) workload, Observing time lapses in the controller logs whenever there are more than one controller in the PKS cluster.
Environment details: PKS / K8 with medium cluster:
Steps to reproduce
kubectl logs -f po/<controller-pod-name>
command to monitor the controller logs.kubectl exec -it <controller1-pod-name> reboot
command.ControllerServiceStarter STARTING
then it is not logging anything and not serving read/write requests from the client.reboot
on the active controller ) which makes the idle controller to resume it's services.Snip of time lapses in controller log:-
Problem location My observation here is, it seems load balancing not happening properly when restarted/failed controller resumes it's operation if more than one controller in the cluster.