Open scottlaird opened 4 years ago
Comments on my previous routes-not-propagating bug had questions about next-hop entries. I'm not sure how those are supposed to work with IPv6, but I don't see any v6 next hops in either DB.
$ redis-dump -d 0 -y
...
"ROUTE_TABLE:2001:470:e959:eeee::/64": {
"expireat": 1595699202.551358,
"ttl": -0.001,
"type": "hash",
"value": {
"ifname": "Ethernet116",
"nexthop": "fe80::9a03:9bff:fe77:95e6"
}
},
...
$ redis-dump -d 0 -y | grep fe80::9a03:9bff:fe77:95e6
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
"nexthop": "fe80::9a03:9bff:fe77:95e6"
$ redis-dump -d 1 -y | grep fe80::9a03:9bff:fe77:95e6
$
By comparison, my IPv4 route next-hops have NEIGH_TABLE entries in DB 0 and ASIC_STATE:SAI_OBJECT_TYPE_NEIGHBOR_ENTRY entries in DB 1. So perhaps this is an issue with fpmsyncd
not generating neighbor entries for v6?
I turned swssloglevel
for fpmsyncd
up to DEBUG, and added a new route:
# ip route add 2001:470:e959:dddd::/64 via fe80::9a03:9bff:fe77:95e6 dev Ethernet116
# grep fpmsyncd /var/log/syslog
Jul 25 17:54:28.569127 sw100 DEBUG bgp#fpmsyncd: :> select: enter
Jul 25 17:55:04.534839 sw100 DEBUG bgp#fpmsyncd: :- onRouteMsg: Receive new route message dest ip prefix: 2001:470:e959:dddd::/64
Jul 25 17:55:04.534839 sw100 DEBUG bgp#fpmsyncd: :- onRouteMsg: RouteTable set msg: 2001:470:e959:dddd::/64 fe80::9a03:9bff:fe77:95e6 Ethernet116
Jul 25 17:55:04.534893 sw100 DEBUG bgp#fpmsyncd: :< select: exit
Jul 25 17:55:04.576124 sw100 DEBUG bgp#fpmsyncd: :- main: Pipeline flushed
Jul 25 17:55:04.576124 sw100 DEBUG bgp#fpmsyncd: :> select: enter
That's coming from https://github.com/Azure/sonic-swss/blob/master/fpmsyncd/routesync.cpp, and it looks okay. It has the right next hop and interface. The APPL DB shows the correct ROUTE_TABLE entry and no entry for the next hop.
I then turned up orchagent
logging, and there are a bunch of these:
Jul 25 17:59:18.435766 sw100 INFO swss#orchagent: :- addRoute: Failed to get next hop fe80::9a03:9bff:fe77:95e6@Ethernet116 for 2001:470:e959:dddd::/64
So, it looks like orchagent
is more or less doing the right thing, and something upstream (fpmsyncd
or neighsyncd
?) is screwing up.
I ran redis -d 0 monitor
and added yet another test route, and here's what showed up:
1595700332.809192 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "6875900592cdd1621c6191fe038ec3b29775aa13" "4" "ROUTE_TABLE_CHANNEL" "ROUTE_TABLE_KEY_SET" "_ROUTE_TABLE:2001:470:e959:cccc::/64" "_ROUTE_TABLE:2001:470:e959:cccc::/64" "G" "2001:470:e959:cccc::/64" "nexthop" "fe80::9a03:9bff:fe77:95e6" "ifname" "Ethernet116"
1595700332.809306 [0 lua] "SADD" "ROUTE_TABLE_KEY_SET" "2001:470:e959:cccc::/64"
1595700332.809384 [0 lua] "HSET" "_ROUTE_TABLE:2001:470:e959:cccc::/64" "nexthop" "fe80::9a03:9bff:fe77:95e6"
1595700332.809437 [0 lua] "HSET" "_ROUTE_TABLE:2001:470:e959:cccc::/64" "ifname" "Ethernet116"
1595700332.809478 [0 lua] "PUBLISH" "ROUTE_TABLE_CHANNEL" "G"
1595700332.809687 [2 unix:/var/run/redis/redis.sock] "HGETALL" "COUNTERS:oid:0x150000000004cd"
1595700332.809730 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "88270a7c5c90583e56425aca8af8a4b8c39fe757" "3" "ROUTE_TABLE_KEY_SET" "ROUTE_TABLE:" "ROUTE_TABLE_DEL_SET" "8192" "_"
1595700332.809777 [0 lua] "SPOP" "ROUTE_TABLE_KEY_SET" "8192"
1595700332.809849 [0 lua] "SREM" "ROUTE_TABLE_DEL_SET" "2001:470:e959:cccc::/64"
1595700332.809871 [0 lua] "HGETALL" "_ROUTE_TABLE:2001:470:e959:cccc::/64"
1595700332.809897 [0 lua] "HSET" "ROUTE_TABLE:2001:470:e959:cccc::/64" "nexthop" "fe80::9a03:9bff:fe77:95e6"
1595700332.809944 [0 lua] "HSET" "ROUTE_TABLE:2001:470:e959:cccc::/64" "ifname" "Ethernet116"
1595700332.809985 [0 lua] "DEL" "_ROUTE_TABLE:2001:470:e959:cccc::/64"
I think that's okay; I don't see any code in https://github.com/Azure/sonic-swss-common/blob/master/common/producerstatetable.cpp or any of the Lua that goes with it that knows about NEIGH_TABLE
, or cares about v4 vs v6. So that probably all falls to neighsyncd
.
I suspect that the problem is here: https://github.com/Azure/sonic-swss/blob/a9479e646649e67d28d4afba395ab16c8907e7c7/neighsyncd/neighsync.cpp#L76, where it explicitly ignores v6 link-local neighbors. OSPFv3 explicitly uses link-local neighbors (as per the RFC). The pull request that added that line (Azure/sonic-swss#1065) mentions "some current limitations with handling link-local neighbors" but doesn't provide any details or an issue link.
Does anyone have context on this?
That makes this a duplicate of Azure/sonic-utilities#430, which is ~1.5 years old. Is there a plan for how best to approach this?
This issue might not be OSPF specific.
Looks like your neighbors are IPv6 link local addresses. Are you able to confirm this is the issue only happen to link local neighbors?
I meeted the same problem, and I found that this is because sonic ignores the linklocal address when processing the neighbor table of the kernel. And, the neighbor of the linklocal address will not be issued. Therefore, the learned ipv6 route will be unable to find the nexthop and fail to config.
nl_addr2str(rtnl_neigh_get_dst(neigh), ipStr, MAX_ADDR_SIZE);
/* Ignore IPv6 link-local addresses as neighbors */
if (family == IPV6_NAME && IN6_IS_ADDR_LINKLOCAL(nl_addr_get_binary_addr(rtnl_neigh_get_dst(neigh))))
return;
/* Ignore IPv6 multicast link-local addresses as neighbors */
if (family == IPV6_NAME && IN6_IS_ADDR_MC_LINKLOCAL(nl_addr_get_binary_addr(rtnl_neigh_get_dst(neigh))))
return;
Description
Kernel IPv6 routes aren't consistently propagating from the APPL_DB to the ASIC_DB. This is seen on two different devices running recent(ish) Jenkins builds.
I'm using OSPFv6 to propagate IPv6 routes. They're appearing in the kernel just fine:
And they're in the APPL DB:
But they're not in the ASIC DB:
There is nothing useful in
/var/log/syslog
or/var/log/swss
Looking deeper, there are only 4 IPv6 routes in the ASIC DB:
For comparison,
ip -6 addr show | wc
gives 66 lines, although a few of those are ECMP routes.Oddly, the default route in the ASIC DB (::/0) isn't actually correct, either. Here's what's in the ASIC DB:
That's listed as SAI_PACKET_ACTION_DROP. However, the kernel has a default route:
The APPL DB version matches the kernel:
I'm not sure where the default drop is coming from.
Steps to reproduce the issue:
ip route add 2001:470:e959:ffff::/64 via fe80::ba6a:97ff:fe8a:7168 dev Ethernet120
redis-dump -d 0 -y | grep ffff
redis-dump -d 1 -y | grep ffff
Describe the results you received:
No route for 2001:470:e959:ffff::/64 in the ASIC DB.
Describe the results you expected:
One route for 2001:470:e959:ffff::/64 in the ASIC DB.
Additional information you deem important (e.g. issue happens only occasionally):
Output of
show version
:Attach debug file
sudo generate_dump
:sonic_dump_sw100_20200725_060652.tar.gz