ironcore-dev / dpservice

DPDK based fast Dataplane / L3 router / SDN enabler, installable on compute nodes / SmartNICs
Apache License 2.0
7 stars 1 forks source link

VIP creation blocked by previously removed NAT #573

Closed PlagueCZ closed 3 months ago

PlagueCZ commented 3 months ago

The problem Sometimes in production, when a new cluster is to be put up (via VIP-based bootstrap VM), we get a DNAT_EXISTS error that prevents a VIP from being set-up, thus making the VM inaccessible.

The cause After log analysis it seems that not all neighboring NAT entries are removed via gRPC, while the whole VM (NAT and interface) got removed. This causes the VNI data to be reset without fully cleaning up.

Further information Confirmed by @guvenc in metalnet log analysis, the VNI gets unsubscribed before all neighbor removal messages are received. This is not a problem by design, just that the VNI cleanup needs to be fixed.

This bug only ever happens when a VIP needs to be created on a VNI previously occupied by a NAT that got removed in the way described above, as only VIP can end in DNAT_EXISTS error. This is also why it is only observable by OSC in the case of cluster bootstrapping.

Proposed solution Add DNAT entry cleanup to VNI reset.