oxidecomputer / maghemite

A routing stack written in Rust.
Mozilla Public License 2.0
26 stars 2 forks source link

mg-ddm sometimes does not populate illumos routes #59

Closed jmpesp closed 1 year ago

jmpesp commented 1 year ago

Testing in the Canada region, I have four "gimlets". To reproduce this issue, I start the sled-agent on all four, and prevent RSS from happening. mg-ddm-verify reports that all sleds have received all bootstrap prefix advertisements correctly:

jwm@fancyfeast:~/mg-ddm-verify$ cat sleds.json
  {"name": "dinnerbone", "ip": ""},
  {"name": "kibblesnbits", "ip": ""},
  {"name": "gravytrain", "ip": ""},
  {"name": "frostypaws", "ip": ""}
jwm@fancyfeast:~/mg-ddm-verify$ ./target/debug/mg-ddm-verify
missed directions:

But mg-ddm has failed to set some of the routes for those prefixes: some machines have received a bootstrap address prefix advertisement but do not have a route in the GZ for it:

james@gravytrain:~$ /opt/oxide/mg-ddm/ddmadm get-prefixes
Destination               Next Hop                   Path
fdb0:8061:5f11:ab31::/64  fe80::8261:5fff:fe11:ab30  oxz_switch frostypaws
fdb0:1b:21c1:ffe0::/64    fe80::8261:5fff:fe11:ab30  oxz_switch dinnerbone
fdb0:1b:21c1:fcda::/64    fe80::8261:5fff:fe11:ab30  oxz_switch kibblesnbits

james@gravytrain:~$ netstat -rn -f dst:fdb0:8061:5f11:ab31::/64

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
fdb0:8061:5f11:ab31::/64    fe80::8261:5fff:fe11:ab30   UG      1       0 ixgbe0 

james@gravytrain:~$ netstat -rn -f dst:fdb0:1b:21c1:ffe0::/64

james@gravytrain:~$ netstat -rn -f dst:fdb0:1b:21c1:fcda::/64

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
fdb0:1b:21c1:fcda::/64      fe80::8261:5fff:fe11:ab30   UG      1       0 ixgbe0

In this scenario, ping packets from dinnerbone to gravytrain's bootstrap address are not answered because gravytrain has no route for it:

james@dinnerbone:~$ ipadm | grep bootstrap6
bootstrap0/bootstrap6 static ok         fdb0:1b:21c1:ffe0::1/64

james@gravytrain:~$ ipadm | grep bootstrap6
bootstrap0/bootstrap6 static ok         fdb0:1b:21c1:fd24::1/64

james@gravytrain:~$ ping fdb0:1b:21c1:ffe0::1
ping: sendto No route to host

james@dinnerbone:~$ ping fdb0:1b:21c1:fd24::1
no answer from fdb0:1b:21c1:fd24::1

This only happens intermittently. The mg-ddm service log on gravytrain says:

james@gravytrain:~$ cat $(svcs -L mg-ddm)
[ Apr 26 19:18:50 Disabled. ]
[ Apr 26 19:18:50 Rereading configuration. ]
[ Apr 26 19:18:50 Enabled. ]
[ Apr 26 19:18:50 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/mg-ddm/pkg/ddm_method_script.sh &"). ]
[ Apr 26 19:18:50 Method "start" exited with status 0. ]
Apr 26 19:18:50.312 INFO [0] sm initialized with addr fe80::21b:21ff:fec1:fd24 on if index 3
Apr 26 19:18:50.312 INFO [0] sm initialized with addr fe80::8:20ff:fe2c:6d90 on if index 4
Apr 26 19:18:50.313 INFO admin: listening on [::]:8000
Apr 26 19:18:50.752 WARN [4] admin event in solicit state: Announce({Ipv6Prefix { addr: fdb0:1b:21c1:fd24::, len: 64 }})
Apr 26 19:18:50.752 WARN [3] admin event in solicit state: Announce({Ipv6Prefix { addr: fdb0:1b:21c1:fd24::, len: 64 }})
Apr 26 19:19:04.313 INFO [3] nbr is fe80::8261:5fff:fe11:ab30@oxz_switch transit
Apr 26 19:19:04.313 INFO [3] exchange: listening on [fe80::21b:21ff:fec1:fd24]:56797
Apr 26 19:19:04.313 INFO waiting for exchange server to start
Apr 26 19:19:04.580 INFO sending 1 routes to illumos
Apr 26 19:19:04.581 INFO removing 0 routes from illumos
Apr 26 19:19:06.566 WARN [3] exchange pull: timeout error: deadline has elapsed
Apr 26 19:19:06.842 INFO sending 1 routes to illumos
Apr 26 19:19:06.842 INFO removing 0 routes from illumos
Apr 26 19:19:08.817 WARN [3] exchange pull: timeout error: deadline has elapsed
Apr 26 19:19:08.818 INFO sending 4 routes to illumos
Apr 26 19:19:08.819 ERRO [3] add system route: set route: io error File exists (os error 17)
Apr 26 19:19:08.819 INFO removing 0 routes from illumos
leftwo commented 1 year ago

I believe I'm seeing the same issue on dogfood rack. I have two sleds and they can't ping each other over the bootstrap6 interface

rcgoodfellow commented 1 year ago

This is happening on the dogfood rack when routers come and go.

rcgoodfellow commented 1 year ago

I believe the File exists messages were a red herring and the actual issue has been solved by #66 and #61. Closing for now. Can reopen if this occurs again.

rcgoodfellow commented 1 year ago

This happened again today, and I think I now see the cause.

The early return at line 370 below will prevent any routes that follow in the routes list from being added to the kernel.


So while the File exists error was a red herring in terms of the route that caused the error, it was the cause of subsequent routes not making it to the kernel.