Memif interfaces have high latency

d-uzlov commented 1 year ago

Expected Behavior

Latency should be consistent, regardless of the type of interface.

Current Behavior

When using memif mechanism there are occasional latency spikes.

See my comment for more details.

Steps to Reproduce

Here is the branch that contains instructions to run tests:

https://github.com/d-uzlov/deployments-k8s/tree/perf-test-1/perf-test

There are also instructions to create visual graphs for results in this branch, and for results that you may create by running the tests yourself.

Context

Initially I used a bare-metal machine from Packet provider, but then I checked and I was able to replicate the issue on my local PC in a VM. So I think the issue should be hardware-independent.

For testing I was using virtual kind cluster(s).

Kind version: v0.17.0
Kubernetes Version: v1.26.0
NSM Version: 8a7a76dbe9267313397e7e59d6c9fb457de3a285

d-uzlov commented 1 year ago

Graphs

The results are arranged roughly in the order that I tested things.

Each test case represented here by a single graph. I captured many graphs from each test, but the results seems pretty consistent, there is not much variation.

Open me

![2023-02-13_19-17-42](https://user-images.githubusercontent.com/36223296/218463199-fc1739a7-9ea9-44bc-b99f-afba51aebb53.png) ![2023-02-13_19-19-26](https://user-images.githubusercontent.com/36223296/218463207-28f70a07-f1ad-4103-bf98-89dbb704a8fb.png) ![2023-02-13_20-05-17](https://user-images.githubusercontent.com/36223296/218466100-52bd1af9-fb74-43c9-9e8b-f6610488d965.png) ![2023-02-13_19-22-31](https://user-images.githubusercontent.com/36223296/218463243-3d1b692b-067c-4367-8409-7daff83931c8.png) ![2023-02-13_19-25-33](https://user-images.githubusercontent.com/36223296/218463392-e06ca7fa-c21b-42fb-a12f-f0af192ebe1b.png) ![2023-02-13_19-27-34](https://user-images.githubusercontent.com/36223296/218463404-046abd41-bf45-4b4b-851a-c8e1788b80e0.png) ![2023-02-13_19-32-36](https://user-images.githubusercontent.com/36223296/218463410-cc29b02f-053a-4546-bde1-8a18a90f8ad9.png) ![2023-02-13_19-34-25](https://user-images.githubusercontent.com/36223296/218463425-ac045647-d332-494b-b7b6-19f536fed697.png)

My notes

There seems to be a latency spike around 50 ms.

This latency spike is the same across these test scenarios:

nsm vl3 interdomain
nsm vl3 single-cluster
nsm nse-composition

Both vl3 and nse-composition examples use memif mechanism for communication between user endpoints.

Vl3 DNS seems to affect results, but disabling it doesn't solve the issue. By the way, now there seems to be a recent update that fixes vl3 DNS latency, but I was testing just a few commits before it was merged.

There are latency spikes even on unrelated interfaces, if a pod has a connection to some vl3 network (see the k8s service but pod has vl3 interface graph).

kernel2kernel example is not affected. Pure k8s service scenario is also not affected (when the pod is not connected to any memif interface). These 2 scenarios make me think that my testing method should be valid, and the issue is really caused by memifs.

d-uzlov commented 1 year ago

Also, I forgot to mention that I have already tried to disable CPU limit for forwarder and switch all of the vpp interfaces into polling mode. This increases the throughput, and decreases typical query latency, but it doesn't affect these slow queries at all.

d-uzlov commented 1 year ago

Also, today I have tested this with pure VPP, without NSM. I have found something similar with VPP ping: typical ping latency is 0.1 ms, but periodically latency spikes 10x, up to 1.1 ms. However, even 10x spike is still 50 times lower than what I see using the NSM.

I have also tried to manually set up 2 kernel interfaces, connected via memif. There I get typical latency of 2 ms, with max latency about 4 ms. So there are no measurable issues.

d-uzlov commented 1 year ago

I have tried running a shorter and a longer test. I see the same behavior: every 100 ms there is a delay for 50 ms.

Run with duration 0.5s consistently got exactly 5 slow queries. Run with duration 20s consistently got exactly 200 slow queries.

Graphs:

Open me

![2023-02-16_14-49-14](https://user-images.githubusercontent.com/36223296/219301571-23ea000c-2cfd-46df-a7c3-f4df7a1e34fa.png) ![2023-02-16_14-49-59](https://user-images.githubusercontent.com/36223296/219301598-13654882-0f88-4736-8c6e-4ecd10673c51.png)

d-uzlov commented 1 year ago

It seems I have found a way to fix the issue.

I tried disabling vpp statistics gathering in forwarder, and I no longer see the consistent 50 ms delays. I have yet to test this change on a proper machine. On dev machine I see very few spurious long queries, but these are most likely related to many other programs that I run on this machine. It's nothing like those consistent "50 ms every 100 ms" that I saw in previous tests. It should be noted that I switched all interfaces into polling mode for this test, so it likely affected the throughput. But the latency issue doesn't seem to be related to the interface RX mode.

Graphs:

Open me

![2023-02-16_14-54-53](https://user-images.githubusercontent.com/36223296/219302962-c95e91a2-db3a-4790-bde3-41f53301a926.png)

d-uzlov commented 1 year ago

To disable the stats I have just inserted return statements into these functions, at the very first line of the function: https://github.com/networkservicemesh/sdk-vpp/blob/36ad9b74b35754a640dad9a37c76c2b882be2d31/pkg/networkservice/stats/common.go#L41 https://github.com/networkservicemesh/sdk-vpp/blob/36ad9b74b35754a640dad9a37c76c2b882be2d31/pkg/networkservice/stats/common.go#L74

For a proper fix we will obviously need some better solution.

d-uzlov commented 1 year ago

I think I have found the root cause of the issue, and it's not related to vpp statistics at all.

Earlier I wrote that disabling CPU limits doesn't affect the issue:

Also, I forgot to mention that I have already tried to disable CPU limit for forwarder and switch all of the vpp interfaces into polling mode.

Turns out, it was wrong. My issue was that I changed limits on forwarder. But in vl3 network forwarder is not the main component. In vl3 network traffic is managed by VPP instance running inside the vl3 NSE.

When I disabled limits on both forwarder and vl3 NSE the issue with latency was completely solved.

It seems like the latency with CPU limits is a kubernetes issue. By default k8s manages CPU limits in periods of 100 ms. If the CPU limit is set to 500m, like we currently have in the vl3 NSE yaml, then for each 100 ms time period NSE is allowed to run 50 ms (in summary for all threads in the pod). As soon as the 50 ms limit is hit, all the threads in the pod are stopped until the end of the current 100 ms period.

It seems to me that it would be wise to disable CPU limits on both forwarder and vl3 NSE. And probably on any other deployment that runs VPP instance.

networkservicemesh / deployments-k8s