Closed tzssangglass closed 2 years ago
Thanks for your investigation and for a detailed description of the issue! Looking at the flamegraph it seems that most CPU time is getting spent in key:match
, which is something that we might be able to improve by replacing it with ngx.re.match
as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).
I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running? If so, I think adding one as part of the metric_data
loop will make a lot of sense and seems like a great way to minimize latency impact on other requests "assigned" to the same worker.
I will be happy to review your PR for this.
This sound like a very good idea. If I understand the docs corectly, the call to ngx.sleep(0)
should allow other requests to be processed while the metrics are collected. This might make the collecting request slightly slower, but greatly minimize the impact on other workers.
which is something that we might be able to improve by replacing it with
ngx.re.match
as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).
You can assign it to me, I have time for that.
I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running
Here are some references (but in Chinese, from the author of this patch): https://groups.google.com/g/openresty/c/2UXtJHvSpXM/m/ZyLdtKwBAAAJ
Here is patch: https://github.com/openresty/openresty/blob/master/patches/nginx-1.11.2-delayed_posted_events.patch
ok, I will submit PRs.
I tried to optimize it with ngx.sleep(0)
, but it didn't work well, the long-tailed requests still existed after the optimization and the latency was bigger, and optimized string.format
, with very little improvement.
hi @knyar, do you have any plans to release a new version?
I've just shipped a new release 0.20220127
I've just shipped a new release 0.20220127
many thanks!
Hi, thank you for this great library. We have created prometheus plugin in Apache APISIX based on this library.
We found that long tail requests are generated in use and after verification, we found that this phenomenon is related to the
metric_data
function of this library. Original analysis reference: https://github.com/apache/apisix/issues/5755Here are some tests I did
metric_data
:wrk -t2 -c100 -d360s -R1000 --u_latency http://127.0.0.1:9080/hello
wrk -t2 -c100 -d360s -R1000 --u_latency http://127.0.0.1:9080/hello
do nothing
flamegraph:
the result for wrk2 is
the
fix_histogram_bucket_labels
function takes up too much CPU timecomments
fix_histogram_bucket_labels
flamegraph:
the result for wrk2 is
there are some changes, but not as much as I would like.
comments
fix_histogram_bucket_labels
and addngx.sleep(0)
I add
ngx.sleep(0)
at the begin of loop inmetric_data
(here: https://github.com/knyar/nginx-lua-prometheus/blob/master/prometheus.lua#L821) to introduces periodic yielding points, also addedngx.sleep(0)
to some places in the prometheus plugin for APISIX.flamegraph:
the result for wrk2 is
it now appears that the maximum request latency has been effectively improved.
I would like to discuss optimizing the
fix_histogram_bucket_labels
function and introducing periodic yielding points, after the results of the discussion, I would like to submit a PR for optimization.