Closed ondrej-fabry closed 3 years ago
Actually that's 500ms with WaitDone off. Its 3s with WaitDone on.
Also... I am dumbfounded that it takes 150ms to write 13 keys into the keystore in memory...
Also... I am dumbfounded that it takes 150ms to write 13 keys into the keystore in memory...
Why do you assume it's only writing to memory? I see VPP interfaces, routes.. being configured there. Basically all the non-PENDING
CREATE
operations are doing some action in the dataplane.
0 - NBTransaction took 147ms
The reported time for the NB Transaction is 147ms
I presumed since we have a host of SB transactions configuring actual things, that the NB transaction is simply writing to the key store for desired config.
I think you are telling me that the
CREATE:
are doing actual work SB as part of the NB transaction in that 147ms?
Presumably the CREATE [NOOP IS-PENDING] and CREATE [PROPERTY] are not doing work as part of the NB transaction (though presumably the IS-PENDING will later).
What about the CREATE [DERIVED] ?
What about the CREATE [DERIVED] ?
Those are likely doing some dataplane requests as well that is DERIVED from some NB model, e.g. setting VRF table for interface, adding IPs to interface, adding interface to bridge domain.. etc.
More detailed info in docs: Understanding the KV Scheduler transaction log
I presumed since we have a host of SB transactions configuring actual things, that the NB transaction is simply writing to the key store for desired config.
Slightly cleaner gist: https://gist.github.com/edwarnicke/bb7c65e20d870036a2c9b3df8226b0a0 (suppressed the line prefixes with the info about it being vppagent logs so its just showing the actual logs from vppagent)
- NB transactions are done from clients (gRPC, KVDB..)
- SB transactions are simply events/notifications from the dataplane
OK... got it.
So when I read a line like:
1 - SBNotification took 100µs
Is that 100µs from when? From when the config request was made?
Is that 100µs from when? From when the config request was made?
It's always since start of transaction, so when KVScheduler started processing it.
OK... so the vpp related notifications are pretty quick (order of a low hundreds of µs) and the kernel transactions are slow (order of 24ms), and the initial sending of configs (NB#0) is running slow (order of 147ms).
The things I can't reconcile are:
OK... so the vpp related notifications are pretty quick (order of a low hundreds of µs) and the kernel transactions are slow (order of 24ms), and the initial sending of configs (NB#0) is running slow (order of 147ms).
Not really.. it's not about vpp/kernel notifications and actually not about the input values of transactions at all. This is about the executed operations in the transactions:
PENDING
- creating vpp AF_PACKET interface, taking little bit of time..The things I can't reconcile are:
Why if NB#0 in the WaitDone == false case is 147ms but its still taking ~500ms for me to get a GRPC return Why if the NB#0+All the SB Notifications are order ~200ms its taking 3 seconds for me to get a GRPC return in the case where WaitDone == true.
Yes, this is same observation as I summarized in the original post and from a quick look into a code I have not noticed anything weird so will need to do further investigation and test this locally...
Incidently... if you are interested in recreating what I'm working with, you can simply checkout:
https://github.com/networkservicemesh/cmd-forwarder-vppagent
and run
docker run --privileged --rm $(docker build -q --target test .)
(note: the above is just doing the older v3.1.0 (ie, without WaitDone even available).
From this part of the log it seems that the most of time was until the linux hostifname event was received.
time="2020-10-08 20:02:09.55478" level=info msg="Transaction 1 successful!" loc="kvscheduler/txn_process.go(405)" logger=kvscheduler
+======================================================================================================================+
| #2 - SBNotification |
+======================================================================================================================+
* transaction arguments:
- seqNum: 2
- type: SB Notification
- values:
- key: linux/interface/host-name/client-b0518374
val: <EMPTY>
time="2020-10-08 20:02:09.97031" level=debug msg="publish add to ifplugin_ifstate{{client-b0518374-ba46-402b-9aba-5d37cd6f8584 false false vpp-interface-index} 0xc00087d310}" loc="mem/inmemory_name_mapping.go(277)" logger=vpp-ifplugin.if-descriptor
o----------------------------------------------------------------------------------------------------------------------o
* executed operations (2020-10-08 20:02:09.947 +0000 UTC -> 2020-10-08 20:02:09.97 +0000 UTC, dur: 24ms):
1. CREATE [OBTAINED]:
- key: linux/interface/host-name/client-b0518374
- value: <EMPTY>
2. CREATE [WAS-PENDING]:
- key: config/vpp/v2/interfaces/client-b0518374-ba46-402b-9aba-5d37cd6f8584
- value: { name:"client-b0518374-ba46-402b-9aba-5d37cd6f8584" type:AF_PACKET enabled:true rx_modes:{mode:ADAPTIVE} afpacket:{host_if_name:"client-b0518374"} }
3. CREATE [DERIVED NOOP IS-PENDING]:
- key: vpp/interface/client-b0518374-ba46-402b-9aba-5d37cd6f8584/rx-modes
- value: { name:"client-b0518374-ba46-402b-9aba-5d37cd6f8584" type:AF_PACKET rx_modes:{mode:ADAPTIVE} }
x----------------------------------------------------------------------------------------------------------------------x
| #2 - SBNotification took 24ms |
x----------------------------------------------------------------------------------------------------------------------x
time="2020-10-08 20:02:09.97063" level=info msg="Transaction 2 successful!" loc="kvscheduler/txn_process.go(405)" logger=kvscheduler
EDIT: Okay, I think I found the cause... 💥 there is some forced delay for linux notifications: https://github.com/ligato/vpp-agent/blob/master/plugins/linux/ifplugin/descriptor/watcher.go#L42
I will figure out how to get rid of it, either by ignoring the inter-stages or something, but not today.. getting too late for me. I will let you know next week.
Interesting... for my own edification... how do I reconcile the 426ms difference between:
time="2020-10-08 20:02:09.55478" level=info msg="Transaction 1 successful!" loc="kvscheduler/txn_process.go(405)" logger=kvscheduler
and
time="2020-10-08 20:02:09.97063" level=info msg="Transaction 2 successful!" loc="kvscheduler/txn_process.go(405)" logger=kvscheduler
With
x----------------------------------------------------------------------------------------------------------------------x
| #2 - SBNotification took 24ms |
x----------------------------------------------------------------------------------------------------------------------x
@edwarnicke because there is time between transactions.. and if there is no transaction in the queue.. there is nothing to do..
EDIT: Okay, I think I found the cause... 💥 there is some forced delay for linux notifications: https://github.com/ligato/vpp-agent/blob/master/plugins/linux/ifplugin/descriptor/watcher.go#L42
Woohoo! That looks like exactly the thing :)
I will figure out how to get rid of it, either by ignoring the inter-stages or something, but not today.. getting too late for me. I will let you know next week.
I appreciate your patience working through this with me. Enjoy a well deserved rest :)
@edwarnicke Just merged PR #1742 that should fix this, please test the latest image and let me know the outcome.
Closing due inactivity. Issue has been fixed already. If you continue to have some problems please open new issue.
Related issue: #1732
Problem
Updating config via Configurator Update with
WaitDone
enabled takes almost 500ms which seems to be too long making booting into functional state take much longer than acceptable, causing connection issues or dropped packets.original report from Ed Warnicke
> Question… in prep to checkout the new ‘return when done’ stuff, I was measuring some timing for the existing ‘return before done’ stuff. Any idea why the following takes slighly over 1s for the GRPC to return: https://gist.github.com/edwarnicke/ae3ee6f1509f2272935632bdc7d5fd49 > It is sent with the fullResync flag BTW > But 1s feels long > (especially since its in the ‘return before done’ mode) > commitServer time for vppagent update: 1.028992136s > I’ve been testing your new WaitTillDone and that bumps it up to 3s… which feels really really high > OK… so looking through the logs, it looked like part of the delay might have been waiting for vppagent to connect to vpp. So I put a sleep between starting vppagent and vpp and trying my request… that reduced the time down to 613ms for the update with WaitDone == true, and 150ms in with WaitDone == false. Any idea why its taking ~500ms to complete a super super super simple transaction? > or for that matter, why writing 13 keys to a memory local data store is taking 150ms? > By way of comparison, the non-vppagent time in the chain, which involves going from Client -> Forwarder -> NSE (two GRPC calls) and a fair bit of processing in Client and a fair bit of processing (including connection establishment and TLS authentication) in each is only 40ms.Details
The total execution of 4 transactions is just ~250ms, so the rest must be time it takes for notification to come ~or perhaps some sleeps in loops~.
Transaction log
``` Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] +======================================================================================================================+ Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] | #0 - NB Transaction FullResync | Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] +======================================================================================================================+ Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] * transaction arguments: Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] - seqNum: 0 Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] - type: NB Transaction, Full Resync Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] - values: Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] - key: config/linux/interfaces/v2/interface/client-196a9529-38ae-4a0a-bc6c-ed4e81c03887 Oct 8 14:34:26.189 [INFO] [cmd:vpp-agent] val: { name:"client-196a9529-38ae-4a0a-bc6c-ed4e81c03887" type:VETH namespace:Full agent log
Cause
There seems to be hard-coded delay for pushing the linux notifications to the kv scheduler: https://github.com/ligato/vpp-agent/blob/master/plugins/linux/ifplugin/descriptor/watcher.go#L42
Solution
Remove forced delay for linux interface notifications.