oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 40 forks source link

Marking a sled non_provisionable causes live connections of existing instances in opte table to be stuck in SYNC_RCVD state #5873

Open askfongjojo opened 5 months ago

askfongjojo commented 5 months ago

I ran into the issue after marking two sleds non_provisionable (more context in #5872). There were a number of TCP connections established by iperf3 running between instances on different sleds during that time. I noticed a large number of retransmits related to instances running on the sleds marked non_provisionable. I peeked into the opte tcp layer entries and saw that all the TCP connections to the impacted instances were in the SYNC_RCVD state:

BRM42220051 # opteadm dump-tcp-flows -p opte7
FLOW                                    STATE     HITS  SEGS IN  SEGS OUT  BYTES IN  BYTES OUT
TCP:172.30.0.20:5200:172.30.0.10:59562  SYN_RCVD  10    3        8         432       592
TCP:172.30.0.20:5203:172.30.0.10:36768  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5203:172.30.0.10:38560  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5203:172.30.0.10:39284  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5203:172.30.0.10:39528  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5203:172.30.0.10:41718  SYN_RCVD  14    5        10        720       740
TCP:172.30.0.20:5203:172.30.0.10:44424  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5203:172.30.0.10:58624  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:32896  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:38618  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:41132  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:43102  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:59594  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5205:172.30.0.13:60384  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:34954  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:36654  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:38958  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:39322  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:48312  SYN_RCVD  18    7        12        1008      888
TCP:172.30.0.20:5207:172.30.0.16:60504  SYN_RCVD  18    7        12        1008      888

Stopping/starting the instances or putting sleds back to provisionable didn't allow these entries to be unstuck or cleared.

askfongjojo commented 5 months ago

The TCP connections were removed after some duration of inactivity and not left behind indefinitely. As such, feel free to close this as duplicate of #5872 if there is no other concern.