Open PennyAndWang opened 5 years ago
Hi @PennyAndWang , That's a nice catch. I am not sure if it is the right approach to run multiple coordinators in the same cluster as it might make the workers to switch between multiple coordinators and it might lead to some racing issue for memory allocations and other stuffs.
Apart from that the reason why it shows 4 workers is that in ClusterStatsResource
we get the active node count and decrement it by 1 (assuming that there is one active coordiantor) while we can replace it with something like this
'Sets.difference(allNodes.getActiveNodes(), allNodes.getActiveCoordinators()).size();' which gives the effective active worker count.
@Praveen2112 ,really thank you very much for your reply and your advice。 I know the reason according the source code ,that is "if (!isIncludeCoordinator) {activeNodes -= 1;}" 。 I think the real number of workers should be more accurate , so i will submit the code 。
@PennyAndWang in case of multiple coordinators, coordinator should point to localhost
as discovery server. Workers should connect to only single discovery server. For example this could be achieved via coordinator dns binding or attaching dedicated coordinator network interface (e.g: ENI in AWS).
@sopel39 , really thank you very much for your reply. I will explain the real reason. In our company, we deployed a discovery server. The coordinator and the workers both registered with the discovery server. The coordinator is not a discovery service node and is set to "discovery-server.enabled=false". Moreover, there is a layer of proxy on the client and server to implement the highly available function of the coordinator. When we need to restart the coordinator, we need to use the proxy to switch user traffic to the new coordinator, so that it does not affect the normal use of the user. So there will be a situation that there are multiple coordinators in a cluster.
@Praveen2112 ,I want to change the code to "long activeNodes = Sets.difference(nodeManager.getNodes(NodeState.ACTIVE),nodeManager.getAllNodes().getActiveCoordinators()).size();" , but the new problem comes. If the coordinator is set " node-scheduler.include-coordinator=true", ie the coordinator is also a compute node, then the above code is still not very accurate. I want to solve the above problem by the information from the discover server and find that the discovery server information does not provide whether the coordinator is a compute node or not. So now I am a bit confused. Maybe this question is not a big problem for the community, but it still has a little impact on us. By the way, I apologize if this issue affects or bothers you.
The coordinator is not a discovery service node and is set to "discovery-server.enabled=false". Moreover, there is a layer of proxy on the client and server to implement the highly available function of the coordinator.
I don't think you can implement coordinator HA with just single discovery server instance. It could be that dispatcher (https://github.com/prestosql/presto/pull/95) could help here, but its still in review.
is this fixed @PennyAndWang ?