hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.45k stars 4.43k forks source link

Support server-side hierarchial lookups using optional `parent` DC configuration property #1159

Closed bacoboy closed 8 years ago

bacoboy commented 9 years ago

Support server-side hierarchial lookups using optional parent DC configuration property

While I know that issue #154 is related to address some kind of fallback processing when a service isn't found in the queried datacenter, I'm looking for something different which I don't think has been discussed yet.

In the current implementation, the logic looks something like this:

if service request DC.nil? // DC specified in query?
  service = service.myDC // If not append my DC
end
if service.DC found: // Look it up specifically
  return service.DC  // Got one
else
  return NOT FOUND   // Sorry Charlie!
end

The proposed implementation in #154 pushes HOW the lookup should fallback into the query issued (meaning, the client would need to know) to something like this:

if service request DC.nil? // DC not specified in query
  if service.myDC exists: // See if I have one locally (this part is the same as above)
    return service.myDC
  else
    return NOT FOUND
  end
else
  if service.DC=XXXX matches something: // Client passes specific DC or pattern to match
    return service.XXXXX // Found it
  else
    return NOT FOUND // Sorry charlie!
  end
end

What I am proposing is managing the fallback logic in the consul configurations server-side by creating a parent relationship between the consul clsuters (known via less chatty WAN) that the client doesn't need to know anything about. It would use the WAN relationship and an optional configuration on the server side called parent. This is the fallback DC to use if there is no match when one is specified. And you keep asking up the chain until something in found (or not). In pseudocode:

if service request DC.nil? // DC not specified
  if service.THIS_DC exists // check locally
    return service.THIS_DC  // found locally
  else
    if parent property set?
      return lookup(service.PARENT, WAN)  // ask parent
    else
      return NOT FOUND // current logic for compatability
    end
  end
else
  if service.DC=XXXX matches something:  // asked for something specific
    return service.XXXXX  // return if I know about it
  else
    return NOT FOUND // you asked for specific and it doesn't exist
  end
end

The lookup(service.PARENT, WAN) means, do a pass-thru call to the configured parent DC. This seems like it would scale much better and deal with geographic configurations such as this:

                       +------------+                                                                   
 +---------------+     |            |  parent=ap-southeast-1                                            
 |Client in China+---> | cn-north-1 |                                                                   
 +---------------+     |            |                                                                   
                       +-------+----+                                +-----------+                      
                               |                    parent=us-east+1 |           |       +-------------+
                               v                                     | eu-west-1 | <-----+Client in LON|
                                                                     |           |       +-------------+
                       +----------------+                            +----+------+                      
+--------------+       |                | parent=us-west-1                |                             
|Client in Asia+-----> | ap-southeast-1 |                                 |                             
+--------------+       |                |                                 |                             
                       +--------+-------+                                 |                             
                                |                                         |                             
                                |                                         |                             
                                v                                         v                             

                         +-----------+                              +-----------+                       
+-------------+          |           | parent=us-east-1             |           |     +-------------+   
|Client in SFO+------->  | us-west-1 |                              | us-east-1 | <---+Client in NYC|   
+-------------+          |           | +--------------------------> |           |     +-------------+   
                         +-----------+                              +-----------+                       

In this example, I'm trying to serve customers in london, china, and the US.
I certainly don't want people in london calling "local" services in asia, I'd run those in eu-west-1. Chances are there are services I can only run centrally (say a service that backs the central inventory DB -- in this example us-east-1). Let's say that I don't have a license to sell things in china so I move those services to asia, but I serve static content out of china to not hit my customers with china firewall processing.

In this way, I use GSLB load balancing to find "close" entry points to the site, but once in the system, if I can't find what I need "close" I keep going UP until I find what I need. If I can't find it ANYWHERE, THEN return a NOT FOUND.

Clearly it is assumed people don't create loops in their configurations...

I believe this could be done quicker than #154 since the API wouldn't change and the semantics are the same if parent isn't configured. It also keeps decision logic off the client since they shouldn't care what the fallback plan should be -- they just want an answer...

Looking back on my notes this is also related to #208 so referencing here for completeness...

Thoughts?

armon commented 9 years ago

@bacoboy This is an interesting idea! We are working on a new feature for Consul that allows for richer lookup logic when resolving services, and this notion of a linear resolution order could definitely fit in there. That mechanism will be more flexible than the meta-DC and much easier than modifying logic in the existing APIs. The basic idea is to create APIs for creating custom DNS endpoints which dynamic behavior, things like this. I'm going to tag this as an enhancement so we keep this use case in mind as we are working on that feature.

bacoboy commented 9 years ago

Yea, depending on who you talk to, some people favor dumb clients (1 request/1 response). Others like clients to decide by altering the query (or making multiple calls). This clearly falls into the first category as it puts the DC fallback logic into server-side configuration easily managed via chef/puppet/etc...

bacoboy commented 8 years ago

Even with the new tomography features in 0.6, I think this feature still has its place where specific fallback is desired. However, I'd be interested in your thoughts @armon...

slackpad commented 8 years ago

@bacoboy did you have a chance to look at prepared queries? You can use network tomography to select the next best N datacenters, or you can give an explicit list of fallback datacenters, or both.

https://www.consul.io/docs/agent/http/query.html

bacoboy commented 8 years ago

Yes, but in this case I'm not looking to create a nearness relationship based on network transport speed, but more of a specific fallback chain. It also removes the logic from the client where it asked for 1 thing and how the fallback occurs isn't a concern of the client. Using the query would put more logic in the client than I want for this specific use case (I'd rather manage the fallback relationship out of band from the client using something like chef, etc).

slackpad commented 8 years ago

Hi @bacoboy if you set NearestN you can give a specific list of fallback datacenters - you aren't required to use nearness at all.

Also, clients don't typically create their own queries, they are created once and then clients just get the id of the query to execute. Those fields in the link above are for defining a query, but you don't need anything other than the id to execute the query. The client can just look up <id>.query.consul via DNS or make an HTTP request to fetch the results, they won't be exposed to any details of the fallback logic or any other parts of the query. You can alter existing queries on the fly to change the behavior without any changes to your clients, or you could register new queries and give clients the new query id.

armon commented 8 years ago

@bacoboy Our goal is that prepared queries would be the solution to this, in a more generic way. As @slackpad said, you can use the tomography for a "zero touch" failover configuration, but you can also specify the specific fallback order if you care to.

bacoboy commented 8 years ago

I agree that the queries allows control from the client, but in cases where I don't want the client to know anything other than the local consul agent (because updating a zillion client configurations would be bad), the crux of this request is to move this fallback logic to the consul agent configuration.

Yes, tomography allows server side fallback if you want "closeness" to be your fallback mechanism, but that not what I want. I want a server-side way of saying which way to go if not found. @armon you said:

but you can also specify the specific fallback order if you care to

Are you referring to the client side query again or is there a configuration server-side I'm unaware of to specify fallback? I looked again, but didn't see anything.

slackpad commented 8 years ago

Hi @bacoboy I think you might be misunderstanding how prepared queries work. You define the query one time and it's stored on the servers. Clients just execute the query by name so they don't know anything about how the query is defined. It works like this:

  1. You register a new query with the servers by calling https://www.consul.io/docs/agent/http/query.html#general - note that you can define a list of datacenters in the Failover section.
  2. You give your query a name, or use the ID that's returned when you complete step 1.
  3. All you share with your clients is the name or ID of the query. They then lookup <id>.query.consul using DNS, or they execute the query using https://www.consul.io/docs/agent/http/query.html#execute over HTTP.

The clients don't have any idea what the query does or how it's configured (remember they don't have to post any of the information you gave the servers in the first step, they just use the ID you gave them). If you change the query's setup later using https://www.consul.io/docs/agent/http/query.html#specific then any client that executes again with that ID will get the new configuration, you don't have to update them at all.

Please let me know if this helps, or if you have any more questions about how these work. If I understand what you are looking for, I think it sounds really close.

bacoboy commented 8 years ago

You are correct, this is functionally equivalent, but unless you can wildcard the service name (it doesn't appear to be a regex -- and a regex would be expensive I'm sure), if I have 1000 services, I have to inject 1000 nearly identical queries AND have the additional overhead of doing this for new services being added and cleaning up unused queries for decommissioned services.

If you look back at my original suggestion, it is a simple extension for default behavior similar to when you don't specify a dc in the query. Prepared queries are fine grained service level fallback rules -- which you can implement for my use case if I want to connect every single dot. My proposal is for datacenter level fallback rules. Again looking at the current functionality:

if service request DC.nil? // DC specified in query?
  service = service.myDC // If not append my DC
end
if service.DC found: // Look it up specifically
  return service.DC  // Got one
else
  return NOT FOUND   // Sorry Charlie!
end

And the additional of an optional parent configuration property:

if service request DC.nil? // DC not specified
  if service.THIS_DC exists // check locally
    return service.THIS_DC  // found locally
  else
    if parent property set?
      return lookup(service.PARENT, WAN)  // ask parent
    else
      return NOT FOUND // current logic for compatability
    end
  end
else
  if service.DC=XXXX matches something:  // asked for something specific
    return service.XXXXX  // return if I know about it
  else
    return NOT FOUND // you asked for specific and it doesn't exist
  end
end

Here I set 1 rule, 1 time per DC. I'd still be able to use prepared queries if I need finer control on a per-service level.

slackpad commented 8 years ago

@bacoboy ok I understand the difference for the case where there are many, many services and you have good parity between DCs and it makes sense to fallback queries for any service.

slackpad commented 8 years ago

We've got a new extension to prepared queries landing soon that will allow this type of behavior in the form of prepared query templates - https://github.com/hashicorp/consul/pull/1764. You'll be able to define a template prepared query that matches multiple (and potentially all) services within a datacenter and lets you apply prepared query logic to them.

Here's an example query that you could register in cn-north-1 to get the fallback ordering as shown in the diagram above. Note that the Name prefix is empty, so it'll match any service queried:

{
  Name: "",
  "Template": {
    "Type": "name_prefix_match"
  },
  "Service": {
    "Service": "${name.full}",
    "Failover": {
      "Datacenters": ["ap-southeast-1", "us-west-1", "us-east-1"]
    }
}

Once this was configured in cn-north-1, then looking up *.query.consul would try to resolve locally first and then fall back to the listed datacenters. See the PR for more details.

slackpad commented 8 years ago

Closing this out - prepared query templates allow you prefix match service names (up to an empty prefix that matches any service with a single query). This shipped in 0.6.4 so I think we are good here. Please let me know if you have any questions.