The problem
Sometimes in production, when a new cluster is to be put up (via VIP-based bootstrap VM), we get a DNAT_EXISTS error that prevents a VIP from being set-up, thus making the VM inaccessible.
The cause
After log analysis it seems that not all neighboring NAT entries are removed via gRPC, while the whole VM (NAT and interface) got removed. This causes the VNI data to be reset without fully cleaning up.
Further information
Confirmed by @guvenc in metalnet log analysis, the VNI gets unsubscribed before all neighbor removal messages are received. This is not a problem by design, just that the VNI cleanup needs to be fixed.
This bug only ever happens when a VIP needs to be created on a VNI previously occupied by a NAT that got removed in the way described above, as only VIP can end in DNAT_EXISTS error. This is also why it is only observable by OSC in the case of cluster bootstrapping.
Proposed solution
Add DNAT entry cleanup to VNI reset.
The problem Sometimes in production, when a new cluster is to be put up (via VIP-based bootstrap VM), we get a DNAT_EXISTS error that prevents a VIP from being set-up, thus making the VM inaccessible.
The cause After log analysis it seems that not all neighboring NAT entries are removed via gRPC, while the whole VM (NAT and interface) got removed. This causes the VNI data to be reset without fully cleaning up.
Further information Confirmed by @guvenc in metalnet log analysis, the VNI gets unsubscribed before all neighbor removal messages are received. This is not a problem by design, just that the VNI cleanup needs to be fixed.
This bug only ever happens when a VIP needs to be created on a VNI previously occupied by a NAT that got removed in the way described above, as only VIP can end in DNAT_EXISTS error. This is also why it is only observable by OSC in the case of cluster bootstrapping.
Proposed solution Add DNAT entry cleanup to VNI reset.