networkservicemesh / sdk-vpp

Apache License 2.0
2 stars 19 forks source link

Tap server fails if kernel interfaces already exist #315

Closed Bolodya1997 closed 7 months ago

Bolodya1997 commented 3 years ago

Expected Behavior

Tap chain element shouldn't fail if kernel interfaces already exist.

Current Behavior

Tap chain element starts failing if it has already created kernel interfaces but Request comes again for the same NSMgr with another Connection.Id.

Steps to Reproduce

  1. Client Requests NSM.
  2. Requests successes in Forwarder and starts returning.
  3. Request fails with timeout.
  4. Client Requests NSM again with another Request (but same id on 0 path segment).
  5. Tap server fails with VPPApiError: netlink error (-145)
  6. 4-5 reproduces while Client is running.

Failure Logs

NSMgr logs VPP Forwarder logs

Bolodya1997 commented 3 years ago

It is actually related to the https://github.com/networkservicemesh/sdk/issues/1020, but probably can be solved in some other way from the tap chain element side.

edwarnicke commented 3 years ago

@Bolodya1997 is this resolved by https://github.com/networkservicemesh/sdk/pull/1014 ?

Bolodya1997 commented 3 years ago

@edwarnicke It looks like there is the following issue:

  1. Client performs a Request, it reaches Forwarder as id-1 - Forwarder creates a tap interface with id-client name and responses with it.
  2. Request timeout happens before response reaching the Client - no one at this point would call Close for the Forwarder (https://github.com/networkservicemesh/sdk/issues/1020).
  3. Client performs a Request, it reaches Forwarder as id-2 - Forwarder tries to create a tap interface with id-client name and fails because there is already an interface with such name.
  4. [3] repeats on every subsequent Request up to the Forwarder cleans tap interface on timeout.

So networkservicemesh/sdk#1014 doesn't solve this issue.

Actually we have here a problem that both id-1 and id-2 are requesting for the same tap interface. Normally this shouldn't happen, because:

  1. If Client restarts, it fetches old path and so Request reaches Forwarder as id-1.
  2. If Endpoint dies, id-1 is getting Closed during the healing, so id-2 is OK.
  3. If NSMgr restarts, Client restores Connection with old path, so Request reaches Forwarder as id-1.

So the problem here is the following - even if we reuse existing id-client interface for id-2, timeout will happen for id-1 and it would delete this interface. So we either need to somehow close id-1 without waiting for the timeout, or create some refcount(?) for the tap interface.

Thougths?

glazychev-art commented 7 months ago

We don't see this problem anymore