jsdelivr / globalping

A global network of probes to run network tests like ping, traceroute and DNS resolve
https://globalping.io
274 stars 33 forks source link

Load testing #132

Closed jimaek closed 2 years ago

jimaek commented 2 years ago

We need to do some load testing on the API itself, it feels like there is a bottleneck somewhere. Someone was doing multiple POST and GET requests and got 500 errors but it shouldn't have broken anything.

POST is supposed to be limited per IP to prevent excessive requests, but it should be able to handle hundreds of requests per minute without errors. GET should also be very lightweight and work in all cases.

The current server resources should have been enough to process all requests normally. Need to better understand what is wrong. Maybe #131 would be helpful too

patrykcieszkowski commented 2 years ago

Artillery is handy for stress testing, but the IP rate-limit might make it more tricky. https://www.npmjs.com/package/artillery

edit: GET /v1/measurements/afiFkDrPa6HWJBJc https://i.imgur.com/XC1Gmm0.png

jimaek commented 2 years ago

Based on the logs I am not sure if the issue was related to GET or POST. So its best to test everything to better understand what is happening.

patrykcieszkowski commented 2 years ago

but its gonna limit me on 100 queries

jimaek commented 2 years ago

We can deploy a version without any limits so you could easier test everything from a single IP. But its probably easier to debug if you do it on your localhost

alexey-yarmosh commented 2 years ago

In the current configuration (heroku Standard-2X, 1GB RAM) API is able to handle 1 measurement RPS with such body:

{
    "target": "google.com",
    "type": "ping",
    "measurementOptions": {
        "packets": 16
    },
    "limit": 100,
    "locations": []
}

If we increase either RPS or limit field value API will go down in a few minutes with a lack of memory:

2022-10-06T09:59:02.367753+00:00 app[web.1]: 
2022-10-06T09:59:02.367764+00:00 app[web.1]: <--- Last few GCs --->
2022-10-06T09:59:02.367764+00:00 app[web.1]: 
2022-10-06T09:59:02.367767+00:00 app[web.1]: [62:0x65eaf40]  4871124 ms: Scavenge 501.1 (517.6) -> 500.6 (518.4) MB, 5.5 / 0.0 ms  (average mu = 0.316, current mu = 0.250) allocation failure 
2022-10-06T09:59:02.367767+00:00 app[web.1]: [62:0x65eaf40]  4871136 ms: Scavenge 502.0 (518.4) -> 501.4 (522.6) MB, 6.9 / 0.0 ms  (average mu = 0.316, current mu = 0.250) allocation failure 
2022-10-06T09:59:02.367769+00:00 app[web.1]: [62:0x65eaf40]  4872331 ms: Mark-sweep 504.5 (522.6) -> 502.8 (525.4) MB, 1180.7 / 6.3 ms  (average mu = 0.196, current mu = 0.047) allocation failure scavenge might not succeed
2022-10-06T09:59:02.367769+00:00 app[web.1]: 
2022-10-06T09:59:02.367811+00:00 app[web.1]: 
2022-10-06T09:59:02.367812+00:00 app[web.1]: <--- JS stacktrace --->
2022-10-06T09:59:02.367812+00:00 app[web.1]: 
2022-10-06T09:59:02.367838+00:00 app[web.1]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
2022-10-06T09:59:02.369716+00:00 app[web.1]:  1: 0xb02930 node::Abort() [node]
2022-10-06T09:59:02.370965+00:00 app[web.1]:  2: 0xa18149 node::FatalError(char const*, char const*) [node]
2022-10-06T09:59:02.372231+00:00 app[web.1]:  3: 0xcdd16e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
2022-10-06T09:59:02.374806+00:00 app[web.1]:  4: 0xcdd4e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
2022-10-06T09:59:02.375611+00:00 app[web.1]:  5: 0xe94b55  [node]
2022-10-06T09:59:02.377044+00:00 app[web.1]:  6: 0xea481d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
2022-10-06T09:59:02.379010+00:00 app[web.1]:  7: 0xea751e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.380798+00:00 app[web.1]:  8: 0xe68792 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.382412+00:00 app[web.1]:  9: 0xe60da4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.383365+00:00 app[web.1]: 10: 0xe62ab0 v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [node]
2022-10-06T09:59:02.386010+00:00 app[web.1]: 11: 0x123e455 v8::internal::IncrementalStringBuilder::Extend() [node]
2022-10-06T09:59:02.386846+00:00 app[web.1]: 12: 0xf8d070 v8::internal::JsonStringifier::SerializeString(v8::internal::Handle<v8::internal::String>) [node]
2022-10-06T09:59:02.387674+00:00 app[web.1]: 13: 0xf9262d v8::internal::JsonStringifier::Result v8::internal::JsonStringifier::Serialize_<false>(v8::internal::Handle<v8::internal::Object>, bool, v8::internal::Handle<v8::internal::Object>) [node]
2022-10-06T09:59:02.390362+00:00 app[web.1]: 14: 0xf9447f v8::internal::JsonStringify(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [node]
2022-10-06T09:59:02.392596+00:00 app[web.1]: 15: 0xd5f3a7 v8::internal::Builtin_JsonStringify(int, unsigned long*, v8::internal::Isolate*) [node]
2022-10-06T09:59:02.393795+00:00 app[web.1]: 16: 0x15d5519  [node]
2022-10-06T09:59:02.471954+00:00 app[web.1]: Aborted

We should take into account that 1 measurement request starts a limit number of requests to/from probes that itself generates messages for ack, progress and result. So 1 RPS = 1 measurement * 100 probes * (1 ack msg + ~3 progress msgs + 1 result msg).

I'll profile the API locally to understand if there are memory issues. By the way we should consider decreasing the number of messages from probes. For example we can ignore progress messages. From my perspective there isn't much value in unfinished results, at the same time we can reduce the load significantly.

jimaek commented 2 years ago

Real time output is a critical feature and we can't remove it or make it look delayed

MartinKolarik commented 2 years ago

At the very least, we'll need to consider throttling it as previously discussed via Skype (summary in https://github.com/jsdelivr/globalping/issues/176#issuecomment-1264737350).

jimaek commented 2 years ago

Lowering the frequency a bit is possible, but I want it to be clear that real-time must stay and look like real-time :)

alexey-yarmosh commented 2 years ago

Throttling may help a bit. In the meantime for 500 probes it is hard for me to imagine a use case when someone needs so many live updates simultaneously. Probably only for 1 probe if I am building some kind of a user interface (chat bot, etc.). Maybe we can find a balance in the requested number of probes? And either update them live or at the end.

MartinKolarik commented 2 years ago

That's definitely an interesting POV. Sending realtime updates from 10 probes and only the final result from the remaining 490 would still make it look realtime and reduce the load considerably.

jimaek commented 2 years ago

It doesn't matter if it's 500 per request or 100 users run 5 tests at the same time. The system should be able to handle thousands of users running thousands of tests. Some will be small from CLI, others script based running batch jobs...

MartinKolarik commented 2 years ago

We could also leave it up to the user to enable/disable real-time updates and set higher credit costs if enabled. As @alexey-yarmosh points out, in many use cases, they won't be needed/used anyway.

jimaek commented 2 years ago

No, thats just bad UX. The system must be real-time regardless of who is asking what. And your solution would not be helpful at all in most cases where we have 1000 users asking for 1-10 tests where real-time is critical.

alexey-yarmosh commented 2 years ago

Based on profiling and observing app locally I can say that API’s event loop is not very busy during high load. Most of the time it just waits for delegated to CPU activities to finish. I am pretty sure that the main bottleneck of the API is not socket.io messages but thousands of subsequent redis writes. I think redis pipeline should help us a lot, so I’ll try to implement POC and analyze the changes. https://redis.io/docs/manual/pipelining/

alexey-yarmosh commented 2 years ago
Results of perf measurements with redis cloud and redis hosted on vps (4 cores, 16GB RAM, 4GB redis): Measurement median T cloud redis (ms) max T cloud redis (ms) median T vps redis (ms) max T vps redis (ms) median T after migration (ms) max T after migration (ms) median T after clustering+fS (ms) max T after clustering+fS (ms)
10-probes-20-rps-240-duration 66 105 8.9 40
10-probes-60-rps-240-duration 74.4 173 7.9 64
10-probes-70-rps-240-duration 76 200 7 76
10-probes-80-rps-240-duration 407.5 3888 6 67
50-probes-2-rps-240-duration 74.4 109 18 58
50-probes-4-rps-240-duration 77.5 451 16.9 103 40.9 92
50-probes-6-rps-240-duration 837.3 4490 16.9 101 46.1 245
100-probes-1-rps-240-duration 82.3 97 26.8 90 45.2 124
100-probes-2-rps-240-duration 4583.6 22889 36.2 166 43.4 101
100-probes-5-rps-240-duration 23630.3 92401 37 489 117.9 793 172.5 519
100-probes-6-rps-240-duration 68.7 2291 267.8 2094 183.1 504
100-probes-7-rps-240-duration 6702.6 27297 497.8 46646 190.6 877
100-probes-8-rps-240-duration 9416.8 29018 2186.8 35850 194.4 1012
100-probes-10-rps-240-duration 7117 28483 202.4 1396
100-probes-20-rps-240-duration 20958.1 68083 7407.5 32402

Also newrelic charts confirm that redis was the bottleneck. So our current effective throughput is ~6 RPS * 100 probes. Then delays from redis happens again, so I think it worth experimenting with different redis hardware/configuration.

jimaek commented 2 years ago

Redis is single threaded anyway. I have enabled some fake multi-threading options in my config but I dont expect a lot of difference, you can try using it. Also when delays happen, is the RAM maxed out? Is it delayed because Redis tries to expire data first?

alexey-yarmosh commented 2 years ago

Also when delays happen, is the RAM maxed out?

I don't have such info. We can run New Relic agent for redis to get the info, or try with another redis memory size.

Redis is single threaded anyway

And do we have plans or thoughts of having several redis instances? E.g. we can have 1 instance for APIs pub/sub (common for all APIs) and then 3 others to store the measurements (divided by APIs).

jimaek commented 2 years ago

Note for myself, the above tests are from a hetzner cloud shared cpu vps CX41. Here is a single threaded benchmark on the same server.

apt-get install sysbench
sysbench cpu run

CPU speed:
    events per second:   956.94

General statistics:
    total time:                          10.0008s
    total number of events:              9575

Latency (ms):
         min:                                    0.90
         avg:                                    1.04
         max:                                    6.31
         95th percentile:                        1.16
         sum:                                 9994.40

Threads fairness:
    events (avg/stddev):           9575.0000/0.00
    execution time (avg/stddev):   9.9944/0.00

Ideally the prod Redis server needs to have better single threaded performance

MartinKolarik commented 2 years ago

And do we have plans or thoughts of having several redis instances? E.g. we can have 1 instance for APIs pub/sub (common for all APIs) and then 3 others to store the measurements (divided by APIs).

Definitely an option, I already mentioned it somewhere. If the redis CPU is the bottleneck, we can use various sharding strategies, even multiple redis instances for a single app instance.

jimaek commented 2 years ago

I believe we can close this. The current bottleneck is Redis and after migration to new hardware it should be more stable. Later, after higher priority tasks, we will come back to load-testing again to scale the system further. Either by Redis clustering or code changes

jimaek commented 2 years ago

@alexey-yarmosh please do another benchmark of api.globalping.io similar to https://github.com/jsdelivr/globalping/issues/132#issuecomment-1305728479 to have the data of the new prod vs old in case we need it in the future

alexey-yarmosh commented 2 years ago

Updated the table with the values after migration: https://github.com/jsdelivr/globalping/issues/132#issuecomment-1305728479