oxidecomputer / maghemite

A routing stack written in Rust.
Mozilla Public License 2.0
26 stars 2 forks source link

mg-ddm sometimes does not populate illumos routes #59

Closed jmpesp closed 1 year ago

jmpesp commented 1 year ago

Testing in the Canada region, I have four "gimlets". To reproduce this issue, I start the sled-agent on all four, and prevent RSS from happening. mg-ddm-verify reports that all sleds have received all bootstrap prefix advertisements correctly:

jwm@fancyfeast:~/mg-ddm-verify$ cat sleds.json
[
  {"name": "dinnerbone", "ip": "10.0.0.4"},
  {"name": "kibblesnbits", "ip": "10.0.0.5"},
  {"name": "gravytrain", "ip": "10.0.0.6"},
  {"name": "frostypaws", "ip": "10.0.0.7"}
]
jwm@fancyfeast:~/mg-ddm-verify$ ./target/debug/mg-ddm-verify
missed directions:
jwm@fancyfeast:~/mg-ddm-verify$

But mg-ddm has failed to set some of the routes for those prefixes: some machines have received a bootstrap address prefix advertisement but do not have a route in the GZ for it:

james@gravytrain:~$ /opt/oxide/mg-ddm/ddmadm get-prefixes
Destination               Next Hop                   Path
fdb0:8061:5f11:ab31::/64  fe80::8261:5fff:fe11:ab30  oxz_switch frostypaws
fdb0:1b:21c1:ffe0::/64    fe80::8261:5fff:fe11:ab30  oxz_switch dinnerbone
fdb0:1b:21c1:fcda::/64    fe80::8261:5fff:fe11:ab30  oxz_switch kibblesnbits

james@gravytrain:~$ netstat -rn -f dst:fdb0:8061:5f11:ab31::/64

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
fdb0:8061:5f11:ab31::/64    fe80::8261:5fff:fe11:ab30   UG      1       0 ixgbe0 

james@gravytrain:~$ netstat -rn -f dst:fdb0:1b:21c1:ffe0::/64
james@gravytrain:~$ 

james@gravytrain:~$ netstat -rn -f dst:fdb0:1b:21c1:fcda::/64

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
fdb0:1b:21c1:fcda::/64      fe80::8261:5fff:fe11:ab30   UG      1       0 ixgbe0

In this scenario, ping packets from dinnerbone to gravytrain's bootstrap address are not answered because gravytrain has no route for it:

james@dinnerbone:~$ ipadm | grep bootstrap6
bootstrap0/bootstrap6 static ok         fdb0:1b:21c1:ffe0::1/64

james@gravytrain:~$ ipadm | grep bootstrap6
bootstrap0/bootstrap6 static ok         fdb0:1b:21c1:fd24::1/64

james@gravytrain:~$ ping fdb0:1b:21c1:ffe0::1
ping: sendto No route to host

james@dinnerbone:~$ ping fdb0:1b:21c1:fd24::1
no answer from fdb0:1b:21c1:fd24::1

This only happens intermittently. The mg-ddm service log on gravytrain says:

james@gravytrain:~$ cat $(svcs -L mg-ddm)
[ Apr 26 19:18:50 Disabled. ]
[ Apr 26 19:18:50 Rereading configuration. ]
[ Apr 26 19:18:50 Enabled. ]
[ Apr 26 19:18:50 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/mg-ddm/pkg/ddm_method_script.sh &"). ]
[ Apr 26 19:18:50 Method "start" exited with status 0. ]
Apr 26 19:18:50.312 INFO [0] sm initialized with addr fe80::21b:21ff:fec1:fd24 on if index 3
Apr 26 19:18:50.312 INFO [0] sm initialized with addr fe80::8:20ff:fe2c:6d90 on if index 4
Apr 26 19:18:50.313 INFO admin: listening on [::]:8000
Apr 26 19:18:50.752 WARN [4] admin event in solicit state: Announce({Ipv6Prefix { addr: fdb0:1b:21c1:fd24::, len: 64 }})
Apr 26 19:18:50.752 WARN [3] admin event in solicit state: Announce({Ipv6Prefix { addr: fdb0:1b:21c1:fd24::, len: 64 }})
Apr 26 19:19:04.313 INFO [3] nbr is fe80::8261:5fff:fe11:ab30@oxz_switch transit
Apr 26 19:19:04.313 INFO [3] exchange: listening on [fe80::21b:21ff:fec1:fd24]:56797
Apr 26 19:19:04.313 INFO waiting for exchange server to start
Apr 26 19:19:04.580 INFO sending 1 routes to illumos
Apr 26 19:19:04.581 INFO removing 0 routes from illumos
Apr 26 19:19:06.566 WARN [3] exchange pull: timeout error: deadline has elapsed
Apr 26 19:19:06.842 INFO sending 1 routes to illumos
Apr 26 19:19:06.842 INFO removing 0 routes from illumos
Apr 26 19:19:08.817 WARN [3] exchange pull: timeout error: deadline has elapsed
Apr 26 19:19:08.818 INFO sending 4 routes to illumos
Apr 26 19:19:08.819 ERRO [3] add system route: set route: io error File exists (os error 17)
Apr 26 19:19:08.819 INFO removing 0 routes from illumos
leftwo commented 1 year ago

I believe I'm seeing the same issue on dogfood rack. I have two sleds and they can't ping each other over the bootstrap6 interface

rcgoodfellow commented 1 year ago

This is happening on the dogfood rack when routers come and go.

rcgoodfellow commented 1 year ago

I believe the File exists messages were a red herring and the actual issue has been solved by #66 and #61. Closing for now. Can reopen if this occurs again.

rcgoodfellow commented 1 year ago

This happened again today, and I think I now see the cause.

The early return at line 370 below will prevent any routes that follow in the routes list from being added to the kernel.

https://github.com/oxidecomputer/maghemite/blob/043744da65a016a214c9eaa78e65931a785e6690/ddm/src/sys.rs#L354-L375

So while the File exists error was a red herring in terms of the route that caused the error, it was the cause of subsequent routes not making it to the kernel.