performancecopilot / ansible-pcp

Ansible roles for the Performance Co-Pilot toolkit
https://pcp.io
MIT License
20 stars 12 forks source link

The elasticsearch role should NOT assume `localhost` for the elasticsearch service #29

Open portante opened 2 years ago

portante commented 2 years ago

We need to enable the use case where PCP metrics are gathered from an Elasticsearch (or OpenSearch) instance that is not local to the PMDA, and could use HTTPS w/ certificates or bearer tokens to enable metics gathering.

natoscott commented 2 years ago

@portante what are the situations where you would not want to run the PMDA on the same machine? PCP should always be installed on all hosts forming part of any distributed system as performance problems can originate anywhere. In that case, its always the best deployment option (more efficient, simpler installs) and it seems we should be enforcing this (as we are now) so as to not add network load while sampling. It also obviously keeps the roles simpler if we don't have to add more variables, certificate handling, and so on.

portante commented 2 years ago

There are a few reasons.

The statistics for Elasticsearch returned are for the whole cluster, all nodes in the cluster. Some of the queries are very involved when you have lots of indices. So having every node in an Elasticsearch cluster gather all the same metrics, can actually cause problems for the cluster itself.

So if I have a 20 node elasticsearch cluster this would be a huge problem.

Typically, you'd have those metrics collected only from the "master" nodes, since most installations only have a small number of master nodes relative to the total number of nodes in the cluster. But even that is not great, because typically folks deploy with 3 masters for availability.

But again, the elasticsearch metrics PCP collects are not for the host name on which they are collected from, but for the elasticsearch entity cluster itself.

The metrics are not really collected locally. While the elasticsearch API in use might hit a "localhost" end-point, elasticsearch in turn will send out a flood of queries to all the hosts to return the information requested. So there is no really network load being saved.

An elasticsearch admin would want to monitor the health of the cluster from outside the cluster. That is, we would not want a load on one member of the cluster to prevent metrics from being gathered. So while each node of an elasticsearch cluster would have PMDAs for other sub-systems, a "client" node would be setup to participate in the cluster, service the metrics requests, while knowing how to communicate properly with all cluster members.

But burning a whole client node to do that is a bit of a waste, so having an external PMDA target one or more client nodes (often placed behind a load-balance service like haproxy or nginx) gives us the lowest API load for gathering metrics on the cluster.

If PCP had a way for the PMDA to target metrics for something other than host name, then ideally we'd have an archive per elasticsearch cluster.

We run 3 elasticsearch cluster in our environment: Elasticsearch V1, OpenSearch 1.2.4 (Elasticsearch V7 equiv), and a second OpenSearch 1.2.4 cluster just for the logs from the infrastructure nodes providing the services.

So ideally, I'd have one PMDA which could be configured to gather all the metrics from each cluster.

natoscott commented 2 years ago

Makes sense, thanks @portante