olivere / elastic

Deprecated: Use the official Elasticsearch client for Go at https://github.com/elastic/go-elasticsearch
https://olivere.github.io/elastic/
MIT License
7.42k stars 1.15k forks source link

AWS endpoint with varying IP address #1091

Open dwickstrom opened 5 years ago

dwickstrom commented 5 years ago

Please use the following questions as a guideline to help me answer your issue/question without further inquiry. Thank you.

Which version of Elastic are you using?

[x] elastic.v6 (for Elasticsearch 6.x)

Please describe the expected behavior

Hello 👋 We're trying to use this library with an AWS cluster of 3 nodes and specifying the endpoint hostname from AWS as a single entry in the hosts key in the library config file. The ideal situation would be where the client would be able to detect when the IP address changes, re-resolve the hostname and send a retry request, such that during re-provisioning phase no requests are dropped.

Please describe the actual behavior

Requests will fail during the provisioning phase and then, in our case after about 15 minutes, the client will heal itself and requests stop failing.

Because of AWS not exposing the node IPs on the /_nodes endpoint these are my thoughts so far:

With sniffing disabled we see that the single node connection won't be MarkAsDead, due to https://github.com/olivere/elastic/blob/60d62e5b2d1c728d7cbbeb7ed9a284303ea4acd4/client.go#L1204-L1209

With sniffing enabled it's not going to work because sniffing can't be done due to AWS only exposing the load balancer IP. The client won't be able to detect any other nodes: https://github.com/olivere/elastic/blob/60d62e5b2d1c728d7cbbeb7ed9a284303ea4acd4/client.go#L964-L978

Any steps to reproduce the behavior?

  1. instantiate a new client, setting the AWS endpoint as a single host entry in the config
  2. Trigger cluster re-provisioning in AWS, described here: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html#es-managedomains-configuration-changes
olivere commented 5 years ago

Hmm... if I understand it correctly, with AWS you should simply disable sniffing and health checks as AWS does load-balancing for you, and you should simply use the hostname provided by AWS as a single endpoint.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

I'm sorry if misunderstood—I'm not an AWS customer.

dwickstrom commented 5 years ago

Thank you for responding so quickly. Your suggestion to disable both sniffing and healthchecks sounds good and I'm trying it in a moment.

What I don't understand is why the *http.Client won't find the new IP address when it changes. It should simply use the hostname, and the resolver should return the new IP. Unless there's some caching going on, that should simply work... as long as the hostname is the same.

Yes, and even with healthchecks & sniffing disabled I'm guessing that this problem will appear. I'll post my findings here soon.

dwickstrom commented 5 years ago

Alright, thank you! That seems to have solved the problem 🎈 Not sure why though 🤔

I was hoping to learn something from this so I figure I should probably give some more context as to what I was going through. They way we are set up is that we have a cluster of 3 master nodes that we connect to through a single endpoint, as you probably know, the one provided through the aws console. Initially I though of that AWS endpoint address as likely pointing to a load balancer, but after a while I realised that this isn't the case. Instead it's going to cycle randomly, resolving to the IP of any of the nodes.

And so the problem that I have been trying to resolve happens when the cluster is re-provisioned. This is what I think happens during that phase:

  1. The amount of nodes are doubled
  2. The data from the first set of nodes are migrated over to the new cluster
  3. Once data is migrated the endpoint starts resolving to the addresses of the new set of cluster nodes
  4. When data migration is complete, the old nodes are shut down one by one

Here's what I don't understand: after step 3, in the case where healthchecks are enabled, why would the healthcheck requests start failing - as opposed to when healthchecks are disabled, why would the normal requests not fail?

olivere commented 5 years ago

Hmm... let's see.

First of all, the whole idea of sniffing and health checks is only necessary because in the early days of ES, load-balancing was done client-side. If you have a server-side solution, which I think is the right solution, you shouldn't be able to do any of the things. Just let the server do the right thing and keep the client dumb.

Now, sniffing is the process of initially and periodically finding the list of nodes in the connected cluster. Let's say you initially have a 1 node cluster and use elastic to connect to that cluster with a URL. Then ES will use the URL to find all nodes in the cluster (1 node only) via the Cluster State API. It will then throw away the initial URL and use the IP/hostname reported from the cluster API. Once in a while, this process is re-executed to find new nodes in the cluster that were eventually added by the admin. So, eventually, ES will have a full list of IPs/hostnames to connect to and use them via round-robin. Notice there are a few edge-cases like ensuring this process if we do end up with an empty list of nodes for some reason. But let's try to keep it simple.

Health checks serve another purpose. They periodically check the list of nodes and manage the individual state of those nodes. E.g. if elastic tried to send a request to a node that didn't respond, it is marked as dead and no longer used. However, that could only be a blip in the network, so health check runs periodically to mark them as alive eventually. Again, there are some edge cases.

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

dwickstrom commented 5 years ago

I currently don't see why one would disable sniffing but keep health checks enabled. So maybe they should be disabled as well, automatically, when sniffing is disabled.

That sounds reasonable to me.

In any case it might be a good idea to put it into the AWS section of the wiki, to not use either sniffing or healthchecks.

olivere commented 5 years ago

I changed the docs in the Wiki and advised to disable both sniffing and health checks for AWS Elasticsearch Service.

dwickstrom commented 5 years ago

Great, thanks for helping out 🥇

iandees commented 5 years ago

I'm running into the same problem as David, even with healthcheck and sniff turned off. @dwickstrom do you remember if you changed anything on the underlying HTTP client instance maybe?

dwickstrom commented 5 years ago

Hi @iandees, no I didn't change anything on the HTTP client. Lately however there has been some issue with this, again. Back in May, the way I tested it was by toggling some parameter in the cluster settings, to trigger a cluster "rollover". Recently however, when AWS themselves were triggering an elasticsearch upgrade on their side, that "rollover" did not go well - clients were not able to connect without intervention, just like the incidents I had ~6 months ago.

olivere commented 5 years ago

Maybe there's still a problem. Reopening.

olivere commented 5 years ago

There was a change quite recently that addresses an issue on AWS ES with nodes changing IPs particularly. Don't know if this has anything to do with it. https://github.com/olivere/elastic/pull/1125

g-wilson commented 4 years ago

Hi all, resurrecting this thread to shine some more info. We're seeing this issue as well. After some pretty thorough testing I can replicate the issue. I don't think the issue is with this library.

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

If an HTTP client is used which uses keep-alive connections (http.DefaultClient does by default), and your volume of requests is high enough that the idle timeout is never reached, the connection will not be re-established.

This means that when AWS rotates the nodes and changes the DNS records, an application using this library is none the wiser, it won't do another DNS lookup until the connections are left idle and then terminated.

Eventually this library does recognise that requests are failing and resets everything, however this does cause fairly significant interruption of service.

This issue is described well here https://github.com/golang/go/issues/23427

olivere commented 4 years ago

Thanks for reporting your findings, @g-wilson.

Sovietaced commented 3 years ago

AWS ES uses DNS based load-balancing to resolve the hostname to the ES nodes, it's not an EC2-style load balancer.

In this case it seems like clients would benefit from sniffing.

olivere commented 3 years ago

@Sovietaced I'm not sure that's correct. Sniffing is a process by which the client library asks the ES cluster (not the DNS) for the IP addresses of the nodes, then uses those and watches for changes; that's effectively client-side LB. In case of DNS-based LB, the ES cluster usually doesn't know nor update its internal IP addresses. Hence, I think, disabling sniffing and healthchecks is the right way to use Elastic on AWS. Again, I'm not an active user of Elastic on AWS ES.

The problem is, though, that Go itself caches IP addresses for a while, and doesn't resolve for each and every request, hence the reference to https://github.com/golang/go/issues/23427.

Sovietaced commented 3 years ago

@olivere The Java library has the same problem. It resolves an IP address from the ES cluster domain name and caches the IP address of the data node indefinitely. What we notice is that if the IP of the data node we have no longer becomes a data node, our applications are essentially broken (receiving 503s) until we restart them and they get a new IP address from the AWS ES cluster DNS.

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

Sovietaced commented 3 years ago

This is obviously a pretty terrible user experience that seems ripe for the use of sniffing.

I ended up testing sniffing with the AWS ES cluster and it appears that the /_nodes/http?pretty=true API does not even include http info about the nodes so sniffing doesn't work.

olivere commented 3 years ago

Interesting. Maybe we should accommodate to that and—at least—log a warning.

I will have to test this out on AWS ES.

wingsofovnia commented 3 years ago

Here is a comparison of how AWS ES response differs from a normal ES deployment: https://github.com/elastic/elasticsearch-js/issues/1178#issuecomment-621918104

One way to mitigate this might be a custom sniffer that does nslookup instead of GET_nodes/.

root@shell:/# nslookup aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com

Server: a.a.a.a
Address: b.b.b.b#...

Non-authoritative answer:
Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: x.x.x.x # Node IP 1

Name: aws-elasticsearch-domain.eu-central-1a.es.amazonaws.com
Address: y.y.yy # Node IP 2
olivere commented 3 years ago

Thanks for the links. Very helpful.

Sovietaced commented 3 years ago

For what its worth, we ended up writing our own custom sniffer and it appears to work well. I forced a blue/green deployment of an AWS ES cluster and I watched the IP addresses flip with no downtime.

I realize this is a Go library but folks may find this generally useful. This is the basic logic for a periodic task that runs in the background. Note: This approach depends on having a DNS cache TTL set.

Following code is in Kotlin

val addresses: List<InetAddress>

try {
   // host.hostName is the cluster domain name provided by AWS
    addresses = InetAddress.getAllByName(host.hostName).asList()
} catch (e: UnknownHostException) {
    throw AwsSnifferException("Failed to resolve addresses for ${host.hostName}", e)
}

logger.debug("Sniffed addresses: $addresses")

if (addresses.isEmpty()) {
    logger.warn("No nodes to set")
} else {
    val nodes = addresses.stream()
        // Generate new hosts with the address swapped. Retain port/scheme
        .map { HttpHost(it.hostAddress, host.port, host.schemeName) }
        .map { Node(it) }
        .toList()

    logger.debug("Calculated nodes: $nodes")

    restClient.setNodes(nodes)
}
chrisharrisonkiwi commented 3 years ago

I'm also running into the exact same issue with AWS. Any ElasticSearch modification or automated action resulting in the nodes being reassigned seems to result in the issue for around 15 minutes (With both Sniffing and Healthchecks turned off).

Is there an easy way with this library to force a reconnection to the cluster maybe? Might be nice to have a client.Reconnect() option in the event that no nodes are available? I guess I could run client.Stop() and then get a new connection using elastic.NewClient() and see if the new connection has correctly mapped nodes etc.

-- edit I tried the new client idea and it seemed to work. But it's a bit of a sledgehammer on a nail approach.

g-wilson commented 3 years ago

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

chrisharrisonkiwi commented 3 years ago

I'm also running into the exact same issue with AWS.

Instead of doing a full reconnect / new client, you can call CloseIdleConnections method on the *http.Transport that you pass the client itself.

I'm not proud of this, but we do that on a 15 second interval and it works a treat 🤦‍♂️

Yup this works also. A little bit cleaner than the fresh client approach I guess.

olivere commented 3 years ago

I've been looking into this and am experimenting with an additional elastic.SetCloseIdleConnections(true|false) configuration option for elastic.NewClient. When enabled, the PerformRequest method will automatically close idle connections in the underlying HTTP transport whenever it finds a dead node. This should make sure that the client picks up the new IP address whenever the AWS ES cluster reconfigures in any of the specified configuration changes.

If some of you could look into this and give it a thumbs up, #1507 might land in one of the next releases.