networkservicemesh / deployments-k8s

Apache License 2.0
42 stars 34 forks source link

Scalability Testing: System NSM testing #1015

Open Bolodya1997 opened 3 years ago

Bolodya1997 commented 3 years ago

Test plan

Use cases:

  1. Client requests Local Endpoint.
  2. Client requests Remote Endpoint.

Test scenarios:

  1. Client requests Local Endpoint.
    • Setup
      1. Start NSM (Registry, NSMgr, Forwarder) on single node.
      2. Start E endpoints implementing N network services on the same node.
    • Test
      1. Start C clients each requesting R network services on the same node.
  2. Client requests Remote Endpoint.
    • Setup
      1. Start NSM (Registry, NSMgr, Forwarder) on 2 nodes.
      2. Start E endpoints implementing N network services on the second node.
    • Test
      1. Start C clients each requesting R network services on the first node.

Tasks

  1. Investigate how to run MD integration test with variable parameters to build graph dependencies for time, CPU, memory usage of NSM components during the test over those parameters. Estimation: 3d
  2. Create MD integration test for (1) test scenario with E, N, C, R variable parameters and measure time, CPU, memory usage for all NSM components during the test. Estimation: 2d
  3. Create MD integration test for (2) test scenario with E, N, C, R variable parameters and measure time, CPU, memory usage for all NSM components during the test. Estimation: 1d

Estimation

6d

d-uzlov commented 3 years ago

Here is my progress so far:

I have an automated test with the following scenario:

  1. Deploy NSM
  2. Deploy Prometheus
  3. Set test params (numbers E, N, C from the issue description)
  4. Generate configs for clients and endpoints
  5. Deploy everything
  6. Wait for all connections to successfully finish
  7. Wait 30 seconds to make sure Prometheus scraped the pods
  8. Delete everything
  9. Wait 30 seconds more
  10. Get data from Prometheus for test time window and generate plots

Plots' appearance is still WIP. Here are how plots from local case currently look like:

1 service, 1 client, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125081872-b9dd6800-e0f0-11eb-9c21-71d4d19f3428.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125081875-ba75fe80-e0f0-11eb-8375-a4ad068df4f1.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125081878-bb0e9500-e0f0-11eb-85c1-b7187096f444.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125081881-bba72b80-e0f0-11eb-965a-f221cecb5d47.png)
1 service, 5 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125082034-e5f8e900-e0f0-11eb-83a1-c45aeb0dc758.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125082037-e6917f80-e0f0-11eb-8146-c2c85c662efd.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125082041-e72a1600-e0f0-11eb-85a0-3d90f3f22403.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125082043-e72a1600-e0f0-11eb-9276-06526644f5a1.png)
1 service, 15 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125082205-235d7680-e0f1-11eb-8877-bec8e37cc95a.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125082208-23f60d00-e0f1-11eb-984a-59c550d7c173.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125082211-248ea380-e0f1-11eb-930f-f1c66ff7673b.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125082213-248ea380-e0f1-11eb-904c-fbe68608ebf6.png)
1 service, 30 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125082499-7c2d0f00-e0f1-11eb-8090-ab639d28a26d.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125082500-7cc5a580-e0f1-11eb-9ea2-8d673794c133.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125082496-7b947880-e0f1-11eb-9e6b-b009c1d60959.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125082501-7cc5a580-e0f1-11eb-9f1b-1208c3999832.png)

Plots certainly need more work. Remote case is basically ready too (well, as much as local case is ready, with all the work left to with with the plots). We can also run comparisons with more different configurations, like "1 NS, 5 NSC vs 5 NS, 1 NSC", but I didn't have time to run them today.

denis-tingaikin commented 3 years ago

/cc @edwarnicke

denis-tingaikin commented 3 years ago

It seems to me it is a good first step. I think we also can measure Request latency and ping latency in Load. @edwarnicke, @d-uzlov What do you think?

d-uzlov commented 3 years ago

An update on current status:

I have improved plot time resolution. Here are few examples:

Local case, 1 service, 1 client, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125485075-8982e86f-526c-4944-8fec-cad81f571d8e.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125485143-b4607738-1329-44a7-8445-1e8b7987a654.png)
Local case, 1 service, 15 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125485304-dec8e547-1223-4815-be2b-15b994fb4f8a.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125485336-b02a5ed1-d0ae-464e-ad69-03daad6a9286.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125485480-cc2e9717-2e69-4e6a-b7da-b8dca2363c53.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125485495-de2fef86-1c26-458c-848f-8f568b0c15d7.png)
Local case, 5 services, 3 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125485565-d5c56a79-3465-4a60-87b8-0c998fad1683.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125485571-228c71d7-867c-4b8a-ba9f-92a6058a56fa.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125485574-796fbe7d-f0ed-48db-a2c5-2326311296c3.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125485595-85d3f631-b00c-4b8d-8287-eb9abcd4a6ad.png)
Remote case, 1 service, 15 clients, 1 endpoint ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125485766-7b719ac3-0049-48b7-bcb1-d93ecbf4a08b.png) ## NSM memory usage ![nsm-mem](https://user-images.githubusercontent.com/36223296/125485783-db9f8534-05bf-4424-9f98-dcfad5879eb1.png) ## Test CPU usage ![test-cpu](https://user-images.githubusercontent.com/36223296/125485793-d4868634-9070-49a3-8923-6658749ec34a.png) ## Test memory usage ![test-mem](https://user-images.githubusercontent.com/36223296/125485798-9f56ea11-3eb6-4d5e-850b-83a782dfa506.png)
d-uzlov commented 3 years ago

Also, here is a list of issues that I found while I was running scalability tests:

This issue is also related to scalability testing because it's basically impossible to read our current logs after scalability tests:

d-uzlov commented 3 years ago

Another update:

  1. There are now 2 separate cases, orthogonal to local/remote cases:
    • First delete clients and then endpoints. This way heal doesn't happen
    • First delete endpoints, and then clients. This way heal always starts for all connections.
  2. I think I fixed all issues with the plot appearance.

Here are examples:

Local case, 1 service, 15 clients, 1 endpoint, without healing ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125634649-6a2bb0e7-ba19-448d-abaf-e298ca833359.png) ## Test CPU usage per pod ![test-cpu](https://user-images.githubusercontent.com/36223296/125634717-cfc7f78b-6f68-484d-b075-b8900b786db6.png) ## Average test CPU usage ![test-cpu-average](https://user-images.githubusercontent.com/36223296/125634784-634ff768-1195-4aa0-9411-d0fe366c18af.png)
Local case, 1 service, 15 clients, 1 endpoint, with healing ## NSM CPU usage ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125634838-8587c18a-102f-4177-b1ce-c8aef1a3f0e8.png) ## Average test CPU usage ![test-cpu-average](https://user-images.githubusercontent.com/36223296/125634872-93558edf-ae45-4f87-8f64-942f803493ef.png)
d-uzlov commented 3 years ago

Another update:

Few notes from these results:

1. I find it very interesting that "15 clients, 5 endpoints" case is much more CPU intensive in healing than "15 clients, 1 endpoint" case ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125782111-2db0c02b-2d4b-43a6-b3af-7ea1c97902d9.png) ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125782186-58f47a88-8564-4bcb-8d11-8b5df4cc8cd6.png)
2. in cases where I'm maxing out my PC, spire actually consumes very substantial amount of CPU ![nsm-cpu](https://user-images.githubusercontent.com/36223296/125782332-ed693cce-944e-45c3-8b8a-f0913d884055.png)
d-uzlov commented 3 years ago

I have few more improvements to test scenarios in mind:

  1. Currently I check that a client successfully connected by monitoring ip interface list. Maybe it would be more accurate to check logs for request success message, because maybe there could be some delay between interface creation and request success.
  2. Currently I create and delete all clients/endpoints at once. Alternatively we can create then in batches, and maybe do several iterations of create/delete/create. This would better cover healing.
d-uzlov commented 3 years ago

I made a check for request end, and it seems like requests actually end right after the interface is created. I think it still worth keeping, to check that we don't add elements that delay the request on its way back.

d-uzlov commented 3 years ago

I have added "clients restart" and "endpoints restart" cases. See linked pull request for details.

d-uzlov commented 3 years ago
Examples of plots after scalability heal test: ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126157341-d533c92f-300d-48ae-8667-fb577adfb97b.png) ![nsm-mem](https://user-images.githubusercontent.com/36223296/126157343-fd382a11-8b94-4622-81ec-0ee63c1b5333.png) ![test-cpu](https://user-images.githubusercontent.com/36223296/126157346-3f0c206c-2d89-46ee-8031-0e82969331c0.png) ![test-cpu-sum](https://user-images.githubusercontent.com/36223296/126157350-205abc1f-1236-43d3-ad33-774a230404c5.png) ![test-mem](https://user-images.githubusercontent.com/36223296/126157356-9160ff1b-4939-4ee2-8977-0ad3ca2511d8.png) ![test-mem-sum](https://user-images.githubusercontent.com/36223296/126157359-e9a114a4-fe7b-4ff7-8a98-451338ae5983.png)
edwarnicke commented 3 years ago

@d-uzlov Why is it taking 15 second to establish or heal a connection? Where are we leaking all that time? I would expect 100s of milliseconds... not 15 seconds...

d-uzlov commented 3 years ago

Why is it taking us 15s to get a local connection going? That feels very long in terms of latency given that the latency through the cmd-forwarder-vpp is order of 100s of ms… where are we leaking all that time?

I checked it, and I think that the actual reason is that NSEs take some time to start. It takes ~4 seconds for an nse to get its SVID. If we have some load on the system, this time can probably be higher. Registration of an nse happens almost instantly, but maybe in some runs of the test there could also be some delay. In the current version of the test I basically wait for the creation of nse container and deploy clients, which affects load, and can increase time an nse takes to obtain an SVID, and maybe it can increase the delay for nse register registration. (edited)

I ran the test with sleep after endpoints deployment, and with this change a request takes ~0.5 second.

It may be because of synchronization between spire server and spire agents. It can be mitigated, but it's unclear if we should do it during our tests.

d-uzlov commented 3 years ago

@d-uzlov Why is it taking 15 second to establish or heal a connection? Where are we leaking all that time? I would expect 100s of milliseconds... not 15 seconds...

I modified the tests to include more info about how much time each step takes.

Here are results for a few runs ## 1 service, 1 client, 1 endpoint ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126472873-8effc6d5-9fca-4ffc-b5fb-c6cc7cd60e12.png) ![test-cpu](https://user-images.githubusercontent.com/36223296/126472962-ff2f0323-7459-4eaa-8475-9f93e88426bf.png) ## 3 services, 5 clients, 1 endpoint ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126473011-0b8b0c38-582f-44cd-80a6-316fa1935903.png) ![test-cpu](https://user-images.githubusercontent.com/36223296/126473121-6784ff04-5d0b-44f9-87bd-7da3a276f89c.png)

Here we can see, that healing actually takes 1-2 seconds to connect to a new running endpoint, when there is no load. Realistically it can be in order of 100s of milliseconds, because precision of my measurements is ±1 second. However, when the system is under load, we have some issues. In the case 5 clients, 3 requests per client healing took ~10 seconds after the endpoint started. Even worse: I actually have troubles running tests with more clients/requests. 10 clients, 3 requests per client case just never succeeds. Granted, I run the tests in kind using WSL, so my results may not be representative of performance on a real system, but it still looks very strange. Even increasing timeout for healing to 5 minutes doesn't help: seems like some of the connections just never heal.

I guess it would be interesting to deploy new endpoints before deleting old endpoints. It would probably be closer to real-life cases of healing.

d-uzlov commented 3 years ago

I don't think that anything in particular in healing takes a lot of time. It's probably just the system as a whole becomes slow, so healing takes much more time than it should.

edwarnicke commented 3 years ago

@d-uzlov What system are you running on? Have you tried running this on packet?

edwarnicke commented 3 years ago

@d-uzlov The thing is... this whole thing looks very very very much like we have some poorly thoughout global Mutex in the NSMgr somewhere...

edwarnicke commented 3 years ago

@d-uzlov Do you have a sense of how much time we are spending in each component during heal?

d-uzlov commented 3 years ago

My CPU is Ryzen 1700@3.6. I run tests in a WSL2 instance which have 16 GB if RAM. RAM is never maxed out, but CPU is maxed out in heavy tests. Though, I have a feeling that half of the CPU load could be going into VM stuff and kind management.

Do you have a sense of how much time we are spending in each component during heal?

Not yet. We can try taking CPU profile shots but they will likely be messy. Maybe it would help to run tests with modified logs. I haven't touched them recently but I think they should provide enough per-connection info to tell which components take most time.

I believe Vlad is already using modified logs for NSM high load task, maybe he can already provide more info on what's happening in the nsmgr when we have a lot of connections.

edwarnicke commented 3 years ago

@d-uzlov I'd suggest you try on packet first and soonest... I'm betting the issue we are chasing is an artifact of your local env (laptops are good, but running a whole cluster in one for perf/scaling is asking a lot).

edwarnicke commented 3 years ago

@d-uzlov All of that said... you are doing a great job on the graphs and chasing stuff down :)

d-uzlov commented 3 years ago

Ok, I'll try running the tests on Packet. I have just read that Vlad found that packet is working fine under load, so maybe it's really just a local env or kind issue.

d-uzlov commented 3 years ago

I tried running the tests on packet, and it seems to also have issues with healing.

Here's how typical run of scalability heal test looks like ## Run 1 ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126643083-dee1e7d4-aa8b-44b9-86ed-e6fde2f18c51.png) ## Run 2 (failed, heal never succeeded) ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126642816-7e76dfea-49d1-43a3-8c34-138aa6cb2abc.png)

As you can see, when heal succeeds, it can take long time. And often it just never succeeds, and the CPU load is very high. Actually, CPU load when heal starts is usually near number of connections * 1 core. Initial requests also take substantial amount of time.

I tried to gather logs and CPU profiles, but I got few issues integrating their gathering into nsm components and the tests, and I spent some time solving them, so I don't have them yet.

d-uzlov commented 3 years ago

Here are full logs and profiles from 2 failed runs with high load. This runs have identical setup but a bit different results. All profiles describe the period of 60 seconds, starting right after mark Delete endpoints-0....

Run 1 [run_1-netsvc=3-nse=1-nsc=10-remote.zip](https://github.com/networkservicemesh/deployments-k8s/files/6868607/run_1-netsvc.3-nse.1-nsc.10-remote.zip) ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126779807-1d35727d-9c96-40ee-b45e-bccc52447200.png)
Run 2 [run_2-netsvc=3-nse=1-nsc=10-remote.zip](https://github.com/networkservicemesh/deployments-k8s/files/6868610/run_2-netsvc.3-nse.1-nsc.10-remote.zip) ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126779831-22fdb214-e680-4178-82d7-59fc35013014.png)

As you can see, the second run has CPU usage spikes after the end of the test. This happens randomly, some runs have this, some don't. Another interesting thing to notice is that on the first run when healing starts there is only 1 spike from the nsmgr. On the second run there are:

  1. One gigantic spike on forwarder
  2. Two middle-sized spikes on one nsmgr and on another forwarder
  3. One small spike on another nsmgr.

This also happens randomly, and it probably has the same cause as spikes after the end of the test, but I didn't really keep statistics of it, so I may be wrong here.

I tried to analyze the logs and profiles, but I didn't manage to find anything interesting quickly, and decided to gather more data. I didn't have time to read the logs that are linked to this message. I'll research it later.

d-uzlov commented 3 years ago

I found out that my modified logs actually broke healing, and that's the reason why it didn't work for me. However, CPU spikes and long delay for requests are not related to this issue.

Here are an example plot and full logs of a scalability heal test run after I fixed my issue: [run_4-netsvc=3-nse=1-nsc=10-remote.zip](https://github.com/networkservicemesh/deployments-k8s/files/6877724/run_4-netsvc.3-nse.1-nsc.10-remote.zip) ![nsm-cpu](https://user-images.githubusercontent.com/36223296/126981056-b8a54f48-5f57-47b2-82d5-a431545f8b30.png)
d-uzlov commented 3 years ago
Here is an example plot from scalability heal test with my local fixes for both CPU spikes and request delay ![nsm-cpu](https://user-images.githubusercontent.com/36223296/127123538-5cc0e459-396c-499a-988a-05d9e9a6f062.png)

In this image we can actually still see relatively long delay before initial requests finish and before heal finishes. This delay is actually entirely different issue: scalability tests parse results of kubectl exec call, one for each of the pods in the test, and exec calls actually have substantial delay, ~2s for one call, with total delay being ~20s for this test with 10 clients. I'm not yet sure what we can do about this delay.

d-uzlov commented 3 years ago

The issue with kubectl exec delay is more or less fixed:

Run on local kind ![nsm-cpu](https://user-images.githubusercontent.com/36223296/127431584-6b68d983-398d-44b5-927f-17fd71f430f5.png)
Run on packet before fix ![nsm-cpu](https://user-images.githubusercontent.com/36223296/127431609-e03bbbed-469e-4513-9a40-3e68e53f253c.png)
Run on packet after fix ![nsm-cpu](https://user-images.githubusercontent.com/36223296/127431566-71eb8553-3f87-4394-9d80-58340fcfcb30.png)

We still have some delay, but it is constant, we no longer spend clients count * delay to check everything. I think we can tolerate 1-2s of constant delay for measuring.