Open Bolodya1997 opened 3 years ago
Here is my progress so far:
I have an automated test with the following scenario:
Plots' appearance is still WIP. Here are how plots from local case currently look like:
Plots certainly need more work.
Remote case is basically ready too (well, as much as local case is ready, with all the work left to with with the plots).
We can also run comparisons with more different configurations, like "1 NS, 5 NSC
vs 5 NS, 1 NSC
", but I didn't have time to run them today.
/cc @edwarnicke
It seems to me it is a good first step. I think we also can measure Request
latency and ping latency
in Load.
@edwarnicke, @d-uzlov What do you think?
An update on current status:
I have improved plot time resolution. Here are few examples:
Also, here is a list of issues that I found while I was running scalability tests:
This issue is also related to scalability testing because it's basically impossible to read our current logs after scalability tests:
Another update:
Here are examples:
Another update:
Few notes from these results:
I have few more improvements to test scenarios in mind:
I made a check for request end, and it seems like requests actually end right after the interface is created. I think it still worth keeping, to check that we don't add elements that delay the request on its way back.
I have added "clients restart" and "endpoints restart" cases. See linked pull request for details.
@d-uzlov Why is it taking 15 second to establish or heal a connection? Where are we leaking all that time? I would expect 100s of milliseconds... not 15 seconds...
Why is it taking us 15s to get a local connection going? That feels very long in terms of latency given that the latency through the cmd-forwarder-vpp is order of 100s of ms… where are we leaking all that time?
I checked it, and I think that the actual reason is that NSEs take some time to start. It takes ~4 seconds for an nse to get its SVID. If we have some load on the system, this time can probably be higher. Registration of an nse happens almost instantly, but maybe in some runs of the test there could also be some delay. In the current version of the test I basically wait for the creation of nse container and deploy clients, which affects load, and can increase time an nse takes to obtain an SVID, and maybe it can increase the delay for nse register registration. (edited)
I ran the test with sleep
after endpoints deployment, and with this change a request takes ~0.5 second.
It may be because of synchronization between spire server and spire agents. It can be mitigated, but it's unclear if we should do it during our tests.
@d-uzlov Why is it taking 15 second to establish or heal a connection? Where are we leaking all that time? I would expect 100s of milliseconds... not 15 seconds...
I modified the tests to include more info about how much time each step takes.
Here we can see, that healing actually takes 1-2 seconds to connect to a new running endpoint, when there is no load. Realistically it can be in order of 100s of milliseconds, because precision of my measurements is ±1 second.
However, when the system is under load, we have some issues. In the case 5 clients, 3 requests per client
healing took ~10 seconds after the endpoint started.
Even worse: I actually have troubles running tests with more clients/requests. 10 clients, 3 requests per client
case just never succeeds. Granted, I run the tests in kind using WSL, so my results may not be representative of performance on a real system, but it still looks very strange. Even increasing timeout for healing to 5 minutes doesn't help: seems like some of the connections just never heal.
I guess it would be interesting to deploy new endpoints before deleting old endpoints. It would probably be closer to real-life cases of healing.
I don't think that anything in particular in healing takes a lot of time. It's probably just the system as a whole becomes slow, so healing takes much more time than it should.
@d-uzlov What system are you running on? Have you tried running this on packet?
@d-uzlov The thing is... this whole thing looks very very very much like we have some poorly thoughout global Mutex in the NSMgr somewhere...
@d-uzlov Do you have a sense of how much time we are spending in each component during heal?
My CPU is Ryzen 1700@3.6. I run tests in a WSL2 instance which have 16 GB if RAM. RAM is never maxed out, but CPU is maxed out in heavy tests. Though, I have a feeling that half of the CPU load could be going into VM stuff and kind management.
Do you have a sense of how much time we are spending in each component during heal?
Not yet. We can try taking CPU profile shots but they will likely be messy. Maybe it would help to run tests with modified logs. I haven't touched them recently but I think they should provide enough per-connection info to tell which components take most time.
I believe Vlad is already using modified logs for NSM high load task, maybe he can already provide more info on what's happening in the nsmgr when we have a lot of connections.
@d-uzlov I'd suggest you try on packet first and soonest... I'm betting the issue we are chasing is an artifact of your local env (laptops are good, but running a whole cluster in one for perf/scaling is asking a lot).
@d-uzlov All of that said... you are doing a great job on the graphs and chasing stuff down :)
Ok, I'll try running the tests on Packet. I have just read that Vlad found that packet is working fine under load, so maybe it's really just a local env or kind issue.
I tried running the tests on packet, and it seems to also have issues with healing.
As you can see, when heal succeeds, it can take long time.
And often it just never succeeds, and the CPU load is very high. Actually, CPU load when heal starts is usually near number of connections
* 1 core
.
Initial requests also take substantial amount of time.
I tried to gather logs and CPU profiles, but I got few issues integrating their gathering into nsm components and the tests, and I spent some time solving them, so I don't have them yet.
Here are full logs and profiles from 2 failed runs with high load.
This runs have identical setup but a bit different results.
All profiles describe the period of 60 seconds, starting right after mark Delete endpoints-0...
.
As you can see, the second run has CPU usage spikes after the end of the test. This happens randomly, some runs have this, some don't. Another interesting thing to notice is that on the first run when healing starts there is only 1 spike from the nsmgr. On the second run there are:
This also happens randomly, and it probably has the same cause as spikes after the end of the test, but I didn't really keep statistics of it, so I may be wrong here.
I tried to analyze the logs and profiles, but I didn't manage to find anything interesting quickly, and decided to gather more data. I didn't have time to read the logs that are linked to this message. I'll research it later.
I found out that my modified logs actually broke healing, and that's the reason why it didn't work for me. However, CPU spikes and long delay for requests are not related to this issue.
In this image we can actually still see relatively long delay before initial requests finish and before heal finishes. This delay is actually entirely different issue: scalability tests parse results of kubectl exec
call, one for each of the pods in the test, and exec calls actually have substantial delay, ~2s for one call, with total delay being ~20s for this test with 10 clients.
I'm not yet sure what we can do about this delay.
The issue with kubectl exec
delay is more or less fixed:
We still have some delay, but it is constant, we no longer spend clients count
* delay
to check everything. I think we can tolerate 1-2s of constant delay for measuring.
Test plan
Use cases:
Test scenarios:
Tasks
Estimation
6d