optimize the long-tail request phenomenon

tzssangglass commented 2 years ago

Hi, thank you for this great library. We have created prometheus plugin in Apache APISIX based on this library.

We found that long tail requests are generated in use and after verification, we found that this phenomenon is related to the metric_data function of this library. Original analysis reference: https://github.com/apache/apisix/issues/5755

Here are some tests I did

applying pressure to metric_data: wrk -t2 -c100 -d360s -R1000 --u_latency http://127.0.0.1:9080/hello
applying pressure to APISIX :wrk -t2 -c100 -d360s -R1000 --u_latency http://127.0.0.1:9080/hello

do nothing

flamegraph: original_prometheus

the result for wrk2 is

  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
   50.000%    1.13ms
   75.000%    2.56ms
   90.000%    4.82ms
   99.000%   14.70ms
   99.900%   27.95ms
   99.990%   74.75ms
   99.999%  102.78ms
  100.000%  102.78ms

the fix_histogram_bucket_labels function takes up too much CPU time

comments fix_histogram_bucket_labels

flamegraph: rm_fix_histogram_bucket_labels

the result for wrk2 is

    Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
   50.000%  420.00us
   75.000%  782.00us
   90.000%    1.80ms
   99.000%    8.57ms
   99.900%   27.61ms
   99.990%   57.34ms
   99.999%   85.57ms
  100.000%   95.68ms

there are some changes, but not as much as I would like.

comments fix_histogram_bucket_labels and add ngx.sleep(0)

I add ngx.sleep(0) at the begin of loop in metric_data (here: https://github.com/knyar/nginx-lua-prometheus/blob/master/prometheus.lua#L821) to introduces periodic yielding points, also added ngx.sleep(0) to some places in the prometheus plugin for APISIX.

flamegraph:

the result for wrk2 is
```
  Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
 50.000%  458.00us
 75.000%    0.93ms
 90.000%    3.25ms
 99.000%   20.53ms
 99.900%   38.05ms
 99.990%   51.97ms
 99.999%   59.39ms
100.000%   66.24ms
```
it now appears that the maximum request latency has been effectively improved.

I would like to discuss optimizing the fix_histogram_bucket_labels function and introducing periodic yielding points, after the results of the discussion, I would like to submit a PR for optimization.

knyar commented 2 years ago

Thanks for your investigation and for a detailed description of the issue! Looking at the flamegraph it seems that most CPU time is getting spent in key:match, which is something that we might be able to improve by replacing it with ngx.re.match as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).

I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running? If so, I think adding one as part of the metric_data loop will make a lot of sense and seems like a great way to minimize latency impact on other requests "assigned" to the same worker.

I will be happy to review your PR for this.

dolik-rce commented 2 years ago

This sound like a very good idea. If I understand the docs corectly, the call to ngx.sleep(0) should allow other requests to be processed while the metrics are collected. This might make the collecting request slightly slower, but greatly minimize the impact on other workers.

tzssangglass commented 2 years ago

which is something that we might be able to improve by replacing it with ngx.re.match as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).

You can assign it to me, I have time for that.

I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running

Here are some references (but in Chinese, from the author of this patch): https://groups.google.com/g/openresty/c/2UXtJHvSpXM/m/ZyLdtKwBAAAJ

Here is patch: https://github.com/openresty/openresty/blob/master/patches/nginx-1.11.2-delayed_posted_events.patch

ok, I will submit PRs.

tzssangglass commented 2 years ago

I tried to optimize it with ngx.sleep(0), but it didn't work well, the long-tailed requests still existed after the optimization and the latency was bigger, and optimized string.format, with very little improvement.

tzssangglass commented 2 years ago

hi @knyar, do you have any plans to release a new version?

knyar commented 2 years ago

I've just shipped a new release 0.20220127

tzssangglass commented 2 years ago

I've just shipped a new release 0.20220127

many thanks!

knyar / nginx-lua-prometheus

optimize the long-tail request phenomenon #130