demmer commented 6 years ago

Background and Motivation

Currently vitess components in a topology are allocated into one or more cells, which typically correspond to a data center in a physical deployment. For a given shard, it is possible to have replica tablets in multiple cells, while the master tablet is in any one of the cells. The vtgate query routing logic is also aware of this cell topology and will route replica queries to a tablet in the same cell as the vtgate, and by design will only cross cells when routing to a master.

However in many real world scenarios there are topological considerations below the level of a data center. For example, in AWS EC2, hosts are provisioned in a "region" (i.e. data center) and an "availability zone" (i.e. a building/rack within the given data center). As such, there are demonstrable differences in performance, latencies, cost, and fault tolerance between connections that stay inside the same availability zone versus those that cross the availability zone boundary. Other cloud deployments and physical topologies likely share the same characteristics in which the network interconnect between hosts in the same rack / building / cluster can be more performant than between other pairs of hosts.

Proposal

(Updated as per discussion below)

~The proposal is to augment the vttablet and vtgate configurations to optionally include a "sub-cell" identifier alongside the existing cell identifier. The sub-cell of each vttablet would be communicated to the vtgates via the normal healthcheck protocol. Then, when selecting a replica for routing queries, the vtgate would prefer a vttablet in its same sub-cell as opposed to round robin among all the tablets in the cell. This way the vitess topology would be made aware of the performance considerations in the real world deployment and could more predictably exploit the connections inside the same AZ.~

The proposal is to augment the CellInfo configuration with an optional Region. Then each vtgate is configured with cells_to_watch that include all cells in the local region. The routing logic is then modified to enable cross-cell routing for reads to tablets in other regions.

Because it may not be practical to always have a replica tablet in a given ~sub-cell~ cell, it is important that this logic be implemented as a routing preference and not a constraint. In other words, if there is no replica in the same ~sub-cell~ cell as a vtgate, the vtgate should route it to any other available replica in the ~cell as it does today~ region, and should only fail the read if there are no healthy shards in the same region.

michael-berlin commented 6 years ago

In case of EC2 you should treat the "availability zone" as Vitess cell and not the "region". (This is also how we would use Vitess cells internally i.e. we would run multiple Vitess cells in the same region.)

Would this avoid introducing this "sub-cell" layer?

demmer commented 6 years ago

@michael-berlin We discussed that idea internally -- because replica queries do not cross cells (by design), it requires us to have an available replica in every availability zone. That's why the sub-cell would be a preference and not a constraint.

alainjobart commented 6 years ago

We could also go the other way, and add the concept of 'region', group of cells, and if a cell doesn't have any good replica, we go to the region's replicas.

Note this would help us internally a lot more to look at it this way, and would be easier for us to roll out internally...

On Mon, Nov 6, 2017 at 9:23 AM, Michael Demmer notifications@github.com wrote:

@michael-berlin https://github.com/michael-berlin We discussed that idea internally -- because replica queries do not cross cells (by design), it requires us to have an available replica in every availability zone. That's why the sub-cell would be a preference and not a constraint.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/youtube/vitess/issues/3355#issuecomment-342221478, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaVMkT7EDJO2O6BZ-o35ikPkwUPViQOks5sz0B6gaJpZM4QTk6A .

demmer commented 6 years ago

@alainjobart That could work as well and I can see some logic to that. but at first glance that seems like a more intrusive and potentially error prone approach to me for our deployment.

From what I can tell, to do this with the region approach would require something like:

Add the new Region configuration option to all vtgates / vttablets and restart everything with the new setting. Since all components are still in the same cell this should be no change to the behavior.
Create N new cells, one for each AWS availability zone (or rack or whatever).
Reconfigure each of the vtgates with the cell now set to the current AZ. Update the -cells_to_watch setting to include the original cell that still has all the tablets as well as the N new cells. At this point there are no tablets in the same cell as the vtgate, so it would always bubble up to the region.
Reprovision each tablet to put it into the right cell. Ideally this would be just a configuration change and a restart, but looking through the code I don't have 100% confidence that things will "just work" if the same tablet comes up again in a new cell.

In contrast, if we went with the sub-cell approach, then I believe all that's required is to restart all the vtgate / vttablet components one by one with the added sub-cell configuration. Since the point of the new option is just to optimize the replica reads, there is much less risk of breaking something since even if something is misconfigured during the rollout, the worst that can happen is that we'd have sub-optimal read performance, while there is more orchestration needed (including a full redeployment of all tablets) in the region approach.

Can you confirm my understanding is correct here?

alainjobart commented 6 years ago

Let me add a few things here:

I think we want to find the best target architecture, the migration strategy should not be the deciding factor here (it's important, and if there is no possible migration, maybe we're doing something wrong! but I don't think that's the case)
The 'cell' concept is also used for topology server: we support one topology server per cell now (along with the global one). What makes sense in this cell/sub-cell or region/cell world? Inside Google, we'd need it at the lowest level. So if we went with region/cell, and kept the topology service with the cell, that would work best for us. I also think it's the most flexible, as multiple locations topology services can co-exist in one topo server, it's possible to put all sub-cell topo services in one topo service. So all in all, I think for this, keeping the topology at the lowest level works best, since then we can support both a topo service per cell, or per region.
do we want to keep one of region or cell optional? I don't think so. If distinction is unused, they should both be set to the same value, probably.
(using region/cell concepts): should the cell names be unique across all regions? Right now the cell name is used in tablet alias, so it has to be. Also used for topology service configuration, again has to be unique. So I think so.

So my vote would be for region (higher level than cell). I agree with your assessment on the transition, it's complicated, because your tables are now per-region, they would need to move to per-cell. but your plan sounds good though.

What do you think?

@sougou @michael-berlin any opinion?

bbeaudreault commented 6 years ago

I think HubSpot would also be interested in this. cc @lumost @acharis

We also run our cells cross-AZ currently, which is non-ideal for read replicas. I don't think having a cell per AZ makes sense for us either, similarly to @demmer. I also agree it would be more work to do the region change, but it does sound cleaner TBH.

demmer commented 6 years ago

@alainjobart Thanks for the thorough comments.

You make a very good point -- we should err towards the side of the best final architecture as long as there is some viable migration path.
Speaking on behalf of Slack, we would not want to run a separate topology service at the lowest level. We in fact currently run a single consul topology for both the vitess global and local spheres (though obviously with different paths in consul). I didn't fully consider this in the migration plan above -- it seems like this would also require setting up a topo service in each local cell (or sharing the installation as we do).
My original thought was that cell would be required and sub-cell would be optional. This was partially driven off easy of migration for existing deployments, since no-one would be forced to change their configurations when the feature was introduced. If we go with your proposal for Region and Cell then I agree they should be required but of course allow them to be set to the same thing.
My instinct is yes that all names should be unique. In fact it seems like the names would have to be unique in order to allow deployments (like ours) to share the same topo service in the various cells. I see no clear benefit to allow for non-unique cell names and I think it would just make things more confusing.

With all this in mind, my vote is still for Cell / Sub-Cell as originally proposed.

In closing, I think that Sub-Cell is a terrible name, and something like Zone would be better.

alainjobart commented 6 years ago

One comment first (and using region/cell naming): if the topo server is per cell, it's trivial to share one in all the cells of a region (and in fact, the paths used in the topo service usually already include the cell). But if the topo server is per region, it's impossible to have a per-cell one if needed. So to keep the flexibility, we'd want the topo service per cell, I think.

Sugu and I talked a bit more about this, and explored a possible solution: add an optional 'Region' string field in the CellInfo record. Then keep vtgate the way it is, with 'cells to watch'. Then when vtgate receives a health check from a tablet (and it's not from the same cell as vtgate), it looks up the CellInfo record, and sees which region it's in (with a cache). If in the same region as the vtgate cell, it uses the tablet as a backup if no local tablets are found. We think it's fairly straightforward to configure (at first, it's not configured, but when you want to use the feature, you just add the 'region' settings, and it works).

@demmer with that proposed solution, since we change the cell definition under you, you would need the migration you mention. :(

demmer commented 6 years ago

But if the topo server is per region, it's impossible to have a per-cell one if needed. So to keep the flexibility, we'd want the topo service per cell, I think That's certainly true, and as you allude it probably tips the overall best solution more towards Region/Cell as opposed to Cell/Zone.

Sugu and I talked a bit more about this, and explored a possible solution: add an optional 'Region' string field in the CellInfo record. That seems like a good plan to me.

@demmer with that proposed solution, since we change the cell definition under you, you would need the migration you mention. :( Well, as they say... no good deed goes unpunished.

When all is said and done I agree that Region/Cell is probably the better approach for the project as a whole, even though Cell/Zone is more immediately convenient for Slack's deployment.

Another idea to throw out there would be to add both the Region and the Zone concepts to the system. Admittedly that is a lot of layers of topology configuration, but besides the immediately applicable convenience benefits for Slack's deployment, I would imagine that there might be topological organization inside the same data center (e.g. the physical rack) where it would make sense to have information exposed to the vitess components, but where you wouldn't want to go through the trouble of setting up a topo service.

derekperkins commented 6 years ago

👍 for Region + Zone. I realize that the original choice of "Cell" was probably to keep things agnostic, but the reality is that most users of Vitess are going to be running in the cloud. Having familiar terms is going to set up users to succeed without having to drill down into the meaning of "Cell", especially if Vitess can set the right defaults for choosing the best read replicas to hit.

demmer commented 6 years ago

@derekperkins For the record I agree with you that Cell is confusing concept, but my last comment was actually that vitess would have all three constructs:

Region (encompasses one or more Cells)
Cell (what it is today, one or more Zones)
Zone (the smallest unit)

derekperkins commented 6 years ago

If Vitess was going to use all three, I would flip the last two. Region and Zone have specific meanings, so that third designation that could refer to a datacenter rack would make more sense IMO as a cell. Having a cell be some designation between a region and a zone would be confusing.

I'm not trying to overly complicate things, but in the spirit of @alainjobart's "I think we want to find the best target architecture," making Vitess concepts easier to comprehend will go a long way towards driving adoption.

demmer commented 6 years ago

Given how widespread the Cell terminology and concept is in the code, it seems quite complicated to remove it from the lexicon (and codebase) altogether.

So, in the spirit of moving this forward, I retract my earlier opinion and agree to drop the Zone / Sub-Cell idea altogether and instead add Region as a construct that represents one or more cells, as suggested by @alainjobart.

This would mean that:

Cell remains as the smallest unit of organization.
Region is an optional additional layer of grouping for related cells.
Each cell has either a fully distinct local topo service or a unique path hierarchy in a shared topo service.
vtgate would be configured with -cells_to_watch to include all cells in the topology (as it would in a multi-region setup)
vtgate's routing logic is modified so that replica reads prefer the local cell but are allowed to cross to other cells in the same region.

Open questions / design decisions:

How should the region be configured / stored / communicated?

My original thought was that we would configure each vtgate / vttablet with a region as part of the command line flags. Currently vtgate is configured with a -cell so I presume we would add a required -region parameter for this change. From what I can tell, vttablet gets its cell configuration via the -tablet_path argument which is required to be of the form cell-id. For consistency if we go this route, I'd say we should a required -region flag instead of overloading the tablet path argument. This would also require including the region in the health check protocol so the vtgates would know the region of each tablet.

The proposal that @alainjobart mentioned above would be to add it to be an optional element of the CellInfo, which would require enhancing the AddCellInfo / UpdateCellInfo vtctl methods accordingly to enable the region to be set. One challenge here is that there doesn't seem to be any support for watching the cell info and there is a comment that explicitly states the cell info isn't designed to be changed at runtime. I think that means that if a change is made, all the vtgates would need to be restarted to pick it up.

Of the two I think putting it in the CellInfo makes more sense and is less disruptive.

Does the routing change need to be controlled by a flag (e.g. something like -allow_cross_cell_same_region_replica_reads`. My intuition is no -- we make this the new default behavior and users can control whether or not they want this based on how they lay out the components into cells / regions.

sjmudd commented 6 years ago

Just a comment from my side. I currently have "pools" of db servers to allow me to isolate traffic between apps and their backend database servers. Normally a backend database is in a single pool but in some cases it may be in more than one pool. I also use "data centres" or "zones" too so am not quite sure if the naming is confusing me or you are looking for something different.

I guess the ideal situation is that apps talk via vtgate to a backend server "as close as possible" but in practice I also have situations where the backend servers in some systems are not located even in the same datacentre and as suggested by @demmer I also don't want to always have to deploy backend database servers just to suit the app.

Latency between one area and another may not be the issue sometimes and inevitably when talking to a master in a multi "zone" db setup the latency to the master is given by it's location: it can't be close to all app servers.

So I would favour vtgate (as this should be close to the app) being able to "route" to some keyspaces which may not be in the same zone/cell etc, and possibly the appropriate weighting to determine that preference need to be configurable as they may change over time or for maintenance you want to move to a backend different location while leaving the apps running happily.

Even in a single cell I currently have different "pools" and this allows me to optionally isolate traffic from one app from traffic from another one. Sougou has said that this concern over isolating apps is less necessary in Vitess due to the way it works and the protection it provides but I do see an advantage to having this isolation. (even if it may not be necessary).

I'm currently using the MySQL interface so probably to support multiple pools I'd need to have more than one vtgate running or maybe allow vtgate to be able to listen for connections on different ports so that it can "identify" the intended target based on the keyspace etc. Having each vtgate configured separately may be easier even if it requires more proceses to be configured.

So I'm not sure if they comments help in any way: clearly there seems to be a desire to support a more flexible approach which has worked well for YouTube for some time if that's possible. I'm not far enough in the evaulation to be sure if this is needed yet but will follow this issue carefully.

demmer commented 6 years ago

@alainjobart / @sougou We've started working on this implementation as part of Slack's fork.

@inexplicable will be taking care of the implementation using the approach of adding region to CellInfo -- I tried "assigning" this issue to him but the autocomplete doesn't work. Perhaps that's because he's not a collaborator on the github project?

Also when you get a chance, please comment on my open question 2 above -- whether or not we need a flag to control the routing behavior.

sougou commented 6 years ago

A quick update on this issue based on discussions:

No additional flag should be needed to preserve old behavior. The absence of a specified region for cells should result in old behavior.
This feature interacts with the algorithm that chooses non-lagged replicas. A hard-coded behavior should be chosen. for now In the future, all this must be factored into a user-specifiable policy file that will allow finer control on this behavior.

inexplicable commented 6 years ago

in progress now: https://github.com/youtube/vitess/pull/3460

alainjobart commented 6 years ago

@inexplicable pointed out some corner case behavior: if the current vtgate cell has no tablet for a keyspace/shard, there will be no SrvKeyspace for that cell, and we will fail even before we try finding the tablets.

Which got me thinking of a different error case: if the destination cell has a different SrvKeyspace that the vtgate cell (because of an on-going horizontal resharding), we might not only use the wrong shards, but also hit unavailable tablets.

So all in all I'm thinking we're missing one piece of the puzzle here, is that we need to re-resolve keyspace_id to shard if we need to hit a different cell. That resolving happens at a higher level in vtgate. That's where we use the SvrKeyspace to see the shard map.

One way to solve this would be to, instead of returning a tablet in a different cell, returning a specific error. Then have the higher level resolver in vtgate find another cell in the same region, and use that to resolve (reading that SrvKeyspace) and try to route to it. This probably requires a bit more detailed design than this, but I think this could work.

@sougou what do you think?

alainjobart commented 6 years ago

In more details, the code here: https://github.com/youtube/vitess/blob/master/go/vt/vtgate/resolver.go#L135 already does a retry, if the error returned by the underlying level is retryable. We could leverage that, and also keep track of which cells work or don't work. So when we re-resolve, we could re-resolve using a blacklist of cells that don't have tablets. The flow would look like:

try the regular cell
no tablets -> return a specific error
higher level catches that error, backlists the current cell
try again with a blacklist of cells that contains the current cell. Need to re-resolve in the possible destination cell.
try to find tablets again.

sougou commented 6 years ago

I had a chat with @alainjobart about this. We agree that the short term fix will be to retry at a higher level.

In reality, this feature request has stretched out the current design beyond its limit. So, we need a new design/model. What we brainstormed was along these lines:

Mapping keyspace ids to shards is not a meaningful function any more. This function should be evolved to map keyspace ids to cell+shards, and this function should take into account the health of the vttablets underneath.
The function also has to take into account any resharding in progress and correspondingly find the correct serving types.
The design has to work correctly for l2 vtgates. Although this is a currently unused server, it's an important architectural component to guarantee our ability to scale indefinitely to "thousands" of shards.
The design has to allow us to define and implement meaningful load balancing policies.

@alainjobart has agreed to write up a design doc for this.

alainjobart commented 6 years ago

This PR has the design doc: https://github.com/youtube/vitess/pull/3496

Note it's more of a description of what we do now. It also happens to have a couple proposed changes that would make things so much cooler.

rafael commented 5 years ago

This is done now. Going to close this issue.

vitessio / vitess

proposal: cell/region topology configuration to improve replica query performance #3355

Background and Motivation

Proposal