Closed jimaek closed 2 years ago
Artillery is handy for stress testing, but the IP rate-limit might make it more tricky. https://www.npmjs.com/package/artillery
edit:
GET /v1/measurements/afiFkDrPa6HWJBJc
https://i.imgur.com/XC1Gmm0.png
Based on the logs I am not sure if the issue was related to GET or POST. So its best to test everything to better understand what is happening.
but its gonna limit me on 100 queries
We can deploy a version without any limits so you could easier test everything from a single IP. But its probably easier to debug if you do it on your localhost
In the current configuration (heroku Standard-2X, 1GB RAM) API is able to handle 1 measurement RPS with such body:
{
"target": "google.com",
"type": "ping",
"measurementOptions": {
"packets": 16
},
"limit": 100,
"locations": []
}
If we increase either RPS or limit
field value API will go down in a few minutes with a lack of memory:
2022-10-06T09:59:02.367753+00:00 app[web.1]:
2022-10-06T09:59:02.367764+00:00 app[web.1]: <--- Last few GCs --->
2022-10-06T09:59:02.367764+00:00 app[web.1]:
2022-10-06T09:59:02.367767+00:00 app[web.1]: [62:0x65eaf40] 4871124 ms: Scavenge 501.1 (517.6) -> 500.6 (518.4) MB, 5.5 / 0.0 ms (average mu = 0.316, current mu = 0.250) allocation failure
2022-10-06T09:59:02.367767+00:00 app[web.1]: [62:0x65eaf40] 4871136 ms: Scavenge 502.0 (518.4) -> 501.4 (522.6) MB, 6.9 / 0.0 ms (average mu = 0.316, current mu = 0.250) allocation failure
2022-10-06T09:59:02.367769+00:00 app[web.1]: [62:0x65eaf40] 4872331 ms: Mark-sweep 504.5 (522.6) -> 502.8 (525.4) MB, 1180.7 / 6.3 ms (average mu = 0.196, current mu = 0.047) allocation failure scavenge might not succeed
2022-10-06T09:59:02.367769+00:00 app[web.1]:
2022-10-06T09:59:02.367811+00:00 app[web.1]:
2022-10-06T09:59:02.367812+00:00 app[web.1]: <--- JS stacktrace --->
2022-10-06T09:59:02.367812+00:00 app[web.1]:
2022-10-06T09:59:02.367838+00:00 app[web.1]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
2022-10-06T09:59:02.369716+00:00 app[web.1]: 1: 0xb02930 node::Abort() [node]
2022-10-06T09:59:02.370965+00:00 app[web.1]: 2: 0xa18149 node::FatalError(char const*, char const*) [node]
2022-10-06T09:59:02.372231+00:00 app[web.1]: 3: 0xcdd16e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
2022-10-06T09:59:02.374806+00:00 app[web.1]: 4: 0xcdd4e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
2022-10-06T09:59:02.375611+00:00 app[web.1]: 5: 0xe94b55 [node]
2022-10-06T09:59:02.377044+00:00 app[web.1]: 6: 0xea481d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
2022-10-06T09:59:02.379010+00:00 app[web.1]: 7: 0xea751e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.380798+00:00 app[web.1]: 8: 0xe68792 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.382412+00:00 app[web.1]: 9: 0xe60da4 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
2022-10-06T09:59:02.383365+00:00 app[web.1]: 10: 0xe62ab0 v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [node]
2022-10-06T09:59:02.386010+00:00 app[web.1]: 11: 0x123e455 v8::internal::IncrementalStringBuilder::Extend() [node]
2022-10-06T09:59:02.386846+00:00 app[web.1]: 12: 0xf8d070 v8::internal::JsonStringifier::SerializeString(v8::internal::Handle<v8::internal::String>) [node]
2022-10-06T09:59:02.387674+00:00 app[web.1]: 13: 0xf9262d v8::internal::JsonStringifier::Result v8::internal::JsonStringifier::Serialize_<false>(v8::internal::Handle<v8::internal::Object>, bool, v8::internal::Handle<v8::internal::Object>) [node]
2022-10-06T09:59:02.390362+00:00 app[web.1]: 14: 0xf9447f v8::internal::JsonStringify(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>) [node]
2022-10-06T09:59:02.392596+00:00 app[web.1]: 15: 0xd5f3a7 v8::internal::Builtin_JsonStringify(int, unsigned long*, v8::internal::Isolate*) [node]
2022-10-06T09:59:02.393795+00:00 app[web.1]: 16: 0x15d5519 [node]
2022-10-06T09:59:02.471954+00:00 app[web.1]: Aborted
We should take into account that 1 measurement request starts a limit
number of requests to/from probes that itself generates messages for ack
, progress
and result
.
So 1 RPS = 1 measurement * 100 probes * (1 ack msg + ~3 progress msgs + 1 result msg)
.
I'll profile the API locally to understand if there are memory issues.
By the way we should consider decreasing the number of messages from probes. For example we can ignore progress
messages. From my perspective there isn't much value in unfinished results, at the same time we can reduce the load significantly.
Real time output is a critical feature and we can't remove it or make it look delayed
At the very least, we'll need to consider throttling it as previously discussed via Skype (summary in https://github.com/jsdelivr/globalping/issues/176#issuecomment-1264737350).
Lowering the frequency a bit is possible, but I want it to be clear that real-time must stay and look like real-time :)
Throttling may help a bit. In the meantime for 500 probes it is hard for me to imagine a use case when someone needs so many live updates simultaneously. Probably only for 1 probe if I am building some kind of a user interface (chat bot, etc.). Maybe we can find a balance in the requested number of probes? And either update them live or at the end.
That's definitely an interesting POV. Sending realtime updates from 10 probes and only the final result from the remaining 490 would still make it look realtime and reduce the load considerably.
It doesn't matter if it's 500 per request or 100 users run 5 tests at the same time. The system should be able to handle thousands of users running thousands of tests. Some will be small from CLI, others script based running batch jobs...
We could also leave it up to the user to enable/disable real-time updates and set higher credit costs if enabled. As @alexey-yarmosh points out, in many use cases, they won't be needed/used anyway.
No, thats just bad UX. The system must be real-time regardless of who is asking what. And your solution would not be helpful at all in most cases where we have 1000 users asking for 1-10 tests where real-time is critical.
Based on profiling and observing app locally I can say that API’s event loop is not very busy during high load. Most of the time it just waits for delegated to CPU activities to finish. I am pretty sure that the main bottleneck of the API is not socket.io messages but thousands of subsequent redis writes. I think redis pipeline should help us a lot, so I’ll try to implement POC and analyze the changes. https://redis.io/docs/manual/pipelining/
Results of perf measurements with redis cloud and redis hosted on vps (4 cores, 16GB RAM, 4GB redis): | Measurement | median T cloud redis (ms) | max T cloud redis (ms) | median T vps redis (ms) | max T vps redis (ms) | median T after migration (ms) | max T after migration (ms) | median T after clustering+fS (ms) | max T after clustering+fS (ms) |
---|---|---|---|---|---|---|---|---|---|
10-probes-20-rps-240-duration | 66 | 105 | 8.9 | 40 | |||||
10-probes-60-rps-240-duration | 74.4 | 173 | 7.9 | 64 | |||||
10-probes-70-rps-240-duration | 76 | 200 | 7 | 76 | |||||
10-probes-80-rps-240-duration | 407.5 | 3888 | 6 | 67 | |||||
50-probes-2-rps-240-duration | 74.4 | 109 | 18 | 58 | |||||
50-probes-4-rps-240-duration | 77.5 | 451 | 16.9 | 103 | 40.9 | 92 | |||
50-probes-6-rps-240-duration | 837.3 | 4490 | 16.9 | 101 | 46.1 | 245 | |||
100-probes-1-rps-240-duration | 82.3 | 97 | 26.8 | 90 | 45.2 | 124 | |||
100-probes-2-rps-240-duration | 4583.6 | 22889 | 36.2 | 166 | 43.4 | 101 | |||
100-probes-5-rps-240-duration | 23630.3 | 92401 | 37 | 489 | 117.9 | 793 | 172.5 | 519 | |
100-probes-6-rps-240-duration | 68.7 | 2291 | 267.8 | 2094 | 183.1 | 504 | |||
100-probes-7-rps-240-duration | 6702.6 | 27297 | 497.8 | 46646 | 190.6 | 877 | |||
100-probes-8-rps-240-duration | 9416.8 | 29018 | 2186.8 | 35850 | 194.4 | 1012 | |||
100-probes-10-rps-240-duration | 7117 | 28483 | 202.4 | 1396 | |||||
100-probes-20-rps-240-duration | 20958.1 | 68083 | 7407.5 | 32402 |
Also newrelic charts confirm that redis was the bottleneck. So our current effective throughput is ~6 RPS * 100 probes. Then delays from redis happens again, so I think it worth experimenting with different redis hardware/configuration.
Redis is single threaded anyway. I have enabled some fake multi-threading options in my config but I dont expect a lot of difference, you can try using it. Also when delays happen, is the RAM maxed out? Is it delayed because Redis tries to expire data first?
Also when delays happen, is the RAM maxed out?
I don't have such info. We can run New Relic agent for redis to get the info, or try with another redis memory size.
Redis is single threaded anyway
And do we have plans or thoughts of having several redis instances? E.g. we can have 1 instance for APIs pub/sub (common for all APIs) and then 3 others to store the measurements (divided by APIs).
Note for myself, the above tests are from a hetzner cloud shared cpu vps CX41. Here is a single threaded benchmark on the same server.
apt-get install sysbench
sysbench cpu run
CPU speed:
events per second: 956.94
General statistics:
total time: 10.0008s
total number of events: 9575
Latency (ms):
min: 0.90
avg: 1.04
max: 6.31
95th percentile: 1.16
sum: 9994.40
Threads fairness:
events (avg/stddev): 9575.0000/0.00
execution time (avg/stddev): 9.9944/0.00
Ideally the prod Redis server needs to have better single threaded performance
And do we have plans or thoughts of having several redis instances? E.g. we can have 1 instance for APIs pub/sub (common for all APIs) and then 3 others to store the measurements (divided by APIs).
Definitely an option, I already mentioned it somewhere. If the redis CPU is the bottleneck, we can use various sharding strategies, even multiple redis instances for a single app instance.
I believe we can close this. The current bottleneck is Redis and after migration to new hardware it should be more stable. Later, after higher priority tasks, we will come back to load-testing again to scale the system further. Either by Redis clustering or code changes
@alexey-yarmosh please do another benchmark of api.globalping.io similar to https://github.com/jsdelivr/globalping/issues/132#issuecomment-1305728479 to have the data of the new prod vs old in case we need it in the future
Updated the table with the values after migration: https://github.com/jsdelivr/globalping/issues/132#issuecomment-1305728479
We need to do some load testing on the API itself, it feels like there is a bottleneck somewhere. Someone was doing multiple POST and GET requests and got 500 errors but it shouldn't have broken anything.
POST is supposed to be limited per IP to prevent excessive requests, but it should be able to handle hundreds of requests per minute without errors. GET should also be very lightweight and work in all cases.
The current server resources should have been enough to process all requests normally. Need to better understand what is wrong. Maybe #131 would be helpful too