Packets sent to primary when both primary and secondary destinations are unhealthy

brian-pane commented 5 years ago

While doing some testing, @ravisinghsfbay and @robloxrob and I ran into a case where glb-director forwards ingress packets to a known unhealthy destination. We're interested in submitting a patch to fix this case, but first I'm hoping to get a quick confirmation that the root-cause analysis below is correct.

The test setup was:

Set up glb-director in front of a pool of four backend servers.
Send some requests through glb-director to verify that the load balancing is working.
Then turn off two of the backend servers so they stop passing health checks. Expected result:
glb-director forwards subsequent TCP connections to one of the healthy servers. Observed result:
glb-director forwards subsequent TCP connections to one of the unhealthy servers. The metadata in the GUE header indicates that the other unhealthy server is that server's secondary.

I verified that glb-healthcheck detected the failed healthchecks and generated a forwarding_table.checked.bin that properly identified the affected servers as state=1 and health=0.

From a quick read through the code, I think the problem is in the construction of the forwarding table in glb-director/cli/main.c. There's some code checks for the case where the primary is unhealthy and the secondary is healthy. But if both the primary and the secondary are unhealthy, that entry in the forwarding table ends up with two unhealthy destinations in it. In the data plane, if an incoming packet hashes to that entry in the forwarding table, glb-director then sends it to the (unhealthy) primary, with a GUE wrapper that lists the (unhealthy) secondary.

Assuming that analysis is correct, the first idea that springs to mind is to increase the number of destination IPs stored in each routing table slot from 2 to a configurable N. That wouldn't prevent the problem altogether, but it would reduce the probability of the problem as N increased.

theojulienne commented 5 years ago

@brian-pane :wave: Thanks for writing this issue! Yes, your understanding here is exactly correct. We discussed a but of the design around this in our blog post announcing the director - we've historically found that multiple machine failure was rare enough that having 2 servers per hash entry was enough (though obviously more would be better). Indeed, we always send them to the "most healthy" of the 2 servers in the hope that one succeeds, but in the case where both are down the primary is selected and that may not work.

We initially did all this with a pattern that was less extensible and didn't support more than 2 servers, but by the open source release it got to a point where everything except the forwarding table supports N servers rather than just 2. The only limitation to N is really the size allowable for the encapsulation GUE packet - on our networks we run with jumboframes but internet traffic is <1500, so there's plenty of room for more. Ideally in this case, the list of servers would be ordered, then all unhealthy servers would be moved to the end of the list, with the highest priority healthy server getting first dibs, followed by any other healthy servers, finally followed by unhealthy servers in order (in the hope they might actually still be working and complete connections).

We've also talked about adding failure domain tags to servers so we can avoid 2 machines in the same row sharing a failure domain (for example, same rack). This makes the probability less likely, but doesn't change the N. This adds some complexity around how fill/drain works and makes the weights per server a little less random, though, so we haven't implemented a fully complete version yet.

I'd be :100: to patches supporting changing the size of the rows in the forwarding table, as long as it's configurable and there's generally a migration path (maybe versioning the forwarding table file or similar).

pavantc commented 4 years ago

@theojulienne wondering if it would be simpler if a row had a status and in the case of an entire row being dead, bump the connection to the next available row (defining 'next row' as a modulo-max-rows operation so that it is circular)? I mean have a static rule of originally-indexed-row + 1 when a row has failed?

I see that reviving the dead nodes/dead row will be trickier with the above though.

theojulienne commented 4 years ago

I see that reviving the dead nodes/dead row will be trickier with the above though.

Yea, that's the problem - if you send it to a wider range of nodes without always falling back to them or having a quiesce period (which adds more complexity), you end up causing all connections to disconnect when machines become healthy again. That's why increasing the table size to N entries and always sending packets with those N alternatives (ordered based on chance of success) is probably the simplest.

pavantc commented 4 years ago

@theojulienne It would still be an availability question. It is up for debate - whether to solve this problem at the cost of extra complexity or return an error (5XX) to the user. Depends on how often we face this. Also required is an evaluation of the extra cost .

The following is a possibility if we are already able to determine when we are starting new connections AND when we are terminating a connection at the director:

Let's say we had this:

struct glb_fwd_config_content_table_entry {
    struct {
        // static
        uint32_t primary;
        uint32_t secondary;
    };
    struct {
        // dynamic
        uint32_t state; // valid states: AVAILABLE, DEGRADED, UNAVAILABLE, READY
        uint32_t connection_count;
    };
} __attribute__((__packed__));

Then,

struct glb_fwd_config_content_table_entry *table;
row_index = hash(s_ip, s_port, d_ip, d_port) & GLB_FMT_TABLE_HASHMASK;
if (table[row_index].state == UNAVAILABLE) {
    row_index = (row_index + 1) % table_size;
}

/* On new connection */
table[row_index].connection_count++; // atomic

/* On connection termination */
table[row_index].connection_count--; // atomic
if (table[row_index].connection_count == 0) {
    // Good time to promote adjacent row from READY to AVAILABLE
    // NOTE: Since the director is threaded, there is a little more here, but not delving yet.
    // We could dive deep if you like the idea.
    promote_index = (row_index > 0 ? row_index - 1 : table_size - 1)
    if (table[promote_index].state == READY) {
        table[promote_index].state = AVAILABLE;
    }
}

The movement from UNAVAILABLE to READY will likely be handled by a thread within the director interfacing with healthcheck. We can discuss the state machine if you think this is interesting. I am hoping this does not need locking and can be managed through atomics. There is still a window of failure when state changes happen before updates to the corresponding state in the table (which is defined by the polling interval of the healthcheck thread).

theojulienne commented 4 years ago

The following is a possibility if we are already able to determine when we are starting new connections AND when we are terminating a connection at the director:

The glb-director is stateless, it doesn't know when connections are created or destroyed and doesn't maintain any TCP connection state, which is part of the design for allowing BGP route changes to direct connections to different glb-director nodes with zero impact. This is discussed a bit in why not LVS and how no state is stored on director.

I think adding extra servers to the forwarding table rows is a significantly simpler change, in that it only adds extra potential hops and doesn't change any of the design constraints of the system, and it buys all the benefits you'd potentially get from un-isolating rows (essentially, adding more potential servers to work around failures).

ravisinghsfbay commented 4 years ago

Hi Theo The statelessness is a pretty strong design aspect that ought to be preserved. So, totally agree.

I think adding extra servers to the forwarding table rows is a significantly simpler change, in that it >only adds extra potential hops and doesn't change any of

I’ve already got this done & tested manually. In the process of updating unit-tests. Following that I’ll create a PR for this.

Regards Ravi

pavantc commented 4 years ago

The following is a possibility if we are already able to determine when we are starting new connections AND when we are terminating a connection at the director:

The glb-director is stateless, it doesn't know when connections are created or destroyed and

The rest is moot :)

My bad. I was excited, then blinded by what seemed like a simple solution to me to use the next row in the forwarding table. I wrongly assumed that we know at the director when new connections are being made (with no further state preservation) so that they can be directed at the secondary (which becomes the new primary) when primary is drained and missed that the redirection is managed via iptables forwarding at the proxies themselves.

github / glb-director

Packets sent to primary when both primary and secondary destinations are unhealthy #77