When fixing the EVPN MAC mobility complexity, the way destinations are indexed in the routing table changed from RD+ETAG+MAC+IP to only RD+MAC. This is incorrect per the BGP EVPN RFC. It works in most cases, as when an IP is present, virtually all EVPN implementations will announce two paths: with and without the IP. This way routes announces are balanced and pose no issues.
Issues arise when GoBGP is connected to multiple peers announcing the same things (read: route reflectors), at a high rate, with lots of routes (hundreds of thousands), and if multiple paths exist for the same mac (e.g. with and without an overlay IP address). The issue does not appear time if any of the four above conditions is false.
There, processing ends up racy and over time, some routes end up missing due to the concurrent updates. Such missing routes have been observed with a production setup with:
hundreds of thousands of routes
tens of updates per second
four route reflectors
With this setup, we ended up with a handful of routes missing (usually 10 to 20) after a few days of runtime.
This commit reverts back the custom tableKey implementation done previously, to use the plain String view of the prefix. It is to be noted this is suboptimal performance wise, but is correct.
When fixing the EVPN MAC mobility complexity, the way destinations are indexed in the routing table changed from RD+ETAG+MAC+IP to only RD+MAC. This is incorrect per the BGP EVPN RFC. It works in most cases, as when an IP is present, virtually all EVPN implementations will announce two paths: with and without the IP. This way routes announces are balanced and pose no issues.
Issues arise when GoBGP is connected to multiple peers announcing the same things (read: route reflectors), at a high rate, with lots of routes (hundreds of thousands), and if multiple paths exist for the same mac (e.g. with and without an overlay IP address). The issue does not appear time if any of the four above conditions is false.
There, processing ends up racy and over time, some routes end up missing due to the concurrent updates. Such missing routes have been observed with a production setup with:
With this setup, we ended up with a handful of routes missing (usually 10 to 20) after a few days of runtime.
This commit reverts back the custom
tableKey
implementation done previously, to use the plainString
view of the prefix. It is to be noted this is suboptimal performance wise, but is correct.Fixes: c393f43 ("evpn: fix quadratic evpn mac-mobility handling")
Sorry for introducing this bug in the first place.