The routes set by policy based routing should not interfere.
Current Behavior
In a Kernel2VXLAN2Kernel use case some routing tables belongin to policy rules are missing. The NSC container is collocated with other application containers in a pod. The NSC application is killed. The connection remains open since the data path healing is disabled. When the NSC container restarts a new connection is established by the forwarder - two parallel connection coexist for the same data-path. After the 10 minutes timeout the 'ghost' connection is closed by the forwarder and at this point the routing table is flushed for the related routing policy. The other connection is sane and remains, but the missing rule are never updated since the forwarder stores the policy keyed by the connection ID and thinks it is already set into the application pod's namespace. Most probably this should not happen if data path healing is enabled or NSC reuse the connection ID requested from nsmgr, but the forwarder might tolerate these deviations.
The rules and routing configuration in the application pod:
# ip rule
0: from all lookup local
32762: from 214.14.132.66 lookup 4
32763: from 214.14.132.65 lookup 3
32764: from 214.14.131.113 lookup 2
32765: from 214.14.131.114 lookup 1
32766: from all lookup main
32767: from all lookup default
bash-4.4# ip route show table all
default via 172.16.16.1 dev nsm-1 table 3 onlink
default via 172.16.16.1 dev nsm-1 table 4 onlink
default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0 scope link
172.16.1.0/24 dev nsm-0 proto kernel scope link src 172.16.1.12
172.16.16.0/24 dev nsm-1 proto kernel scope link src 172.16.16.12
.....
Expected Behavior
The routes set by policy based routing should not interfere.
Current Behavior
In a Kernel2VXLAN2Kernel use case some routing tables belongin to policy rules are missing. The NSC container is collocated with other application containers in a pod. The NSC application is killed. The connection remains open since the data path healing is disabled. When the NSC container restarts a new connection is established by the forwarder - two parallel connection coexist for the same data-path. After the 10 minutes timeout the 'ghost' connection is closed by the forwarder and at this point the routing table is flushed for the related routing policy. The other connection is sane and remains, but the missing rule are never updated since the forwarder stores the policy keyed by the connection ID and thinks it is already set into the application pod's namespace. Most probably this should not happen if data path healing is enabled or NSC reuse the connection ID requested from nsmgr, but the forwarder might tolerate these deviations.
Context
Failure Logs
The last request from the killed NSC:
The request for the new connection:
The
close
after timeout:The rules and routing configuration in the application pod: