knyar / nginx-lua-prometheus

Prometheus metric library for Nginx written in Lua
MIT License
1.46k stars 231 forks source link

optimize the long-tail request phenomenon #130

Closed tzssangglass closed 2 years ago

tzssangglass commented 2 years ago

Hi, thank you for this great library. We have created prometheus plugin in Apache APISIX based on this library.

We found that long tail requests are generated in use and after verification, we found that this phenomenon is related to the metric_data function of this library. Original analysis reference: https://github.com/apache/apisix/issues/5755

Here are some tests I did

  1. do nothing

    flamegraph: original_prometheus

    the result for wrk2 is

      Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
       50.000%    1.13ms
       75.000%    2.56ms
       90.000%    4.82ms
       99.000%   14.70ms
       99.900%   27.95ms
       99.990%   74.75ms
       99.999%  102.78ms
      100.000%  102.78ms

    the fix_histogram_bucket_labels function takes up too much CPU time

  2. comments fix_histogram_bucket_labels

    flamegraph: rm_fix_histogram_bucket_labels

    the result for wrk2 is

        Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
       50.000%  420.00us
       75.000%  782.00us
       90.000%    1.80ms
       99.000%    8.57ms
       99.900%   27.61ms
       99.990%   57.34ms
       99.999%   85.57ms
      100.000%   95.68ms

    there are some changes, but not as much as I would like.

  3. comments fix_histogram_bucket_labels and add ngx.sleep(0)

    I add ngx.sleep(0) at the begin of loop in metric_data (here: https://github.com/knyar/nginx-lua-prometheus/blob/master/prometheus.lua#L821) to introduces periodic yielding points, also added ngx.sleep(0) to some places in the prometheus plugin for APISIX.

    flamegraph: flame

    the result for wrk2 is

      Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
     50.000%  458.00us
     75.000%    0.93ms
     90.000%    3.25ms
     99.000%   20.53ms
     99.900%   38.05ms
     99.990%   51.97ms
     99.999%   59.39ms
    100.000%   66.24ms

    it now appears that the maximum request latency has been effectively improved.


I would like to discuss optimizing the fix_histogram_bucket_labels function and introducing periodic yielding points, after the results of the discussion, I would like to submit a PR for optimization.

knyar commented 2 years ago

Thanks for your investigation and for a detailed description of the issue! Looking at the flamegraph it seems that most CPU time is getting spent in key:match, which is something that we might be able to improve by replacing it with ngx.re.match as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).

I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running? If so, I think adding one as part of the metric_data loop will make a lot of sense and seems like a great way to minimize latency impact on other requests "assigned" to the same worker.

I will be happy to review your PR for this.

dolik-rce commented 2 years ago

This sound like a very good idea. If I understand the docs corectly, the call to ngx.sleep(0) should allow other requests to be processed while the metrics are collected. This might make the collecting request slightly slower, but greatly minimize the impact on other workers.

tzssangglass commented 2 years ago

which is something that we might be able to improve by replacing it with ngx.re.match as was discussed in #124 (I think @unbeatablekb intended to send a PR for that, but never did).

You can assign it to me, I have time for that.

I am not super familiar with yielding points. Is this a way to allow the nginx worker to make progress with other in-flight requests while the CPU-intensive code is running

Here are some references (but in Chinese, from the author of this patch): https://groups.google.com/g/openresty/c/2UXtJHvSpXM/m/ZyLdtKwBAAAJ

Here is patch: https://github.com/openresty/openresty/blob/master/patches/nginx-1.11.2-delayed_posted_events.patch

ok, I will submit PRs.

tzssangglass commented 2 years ago

I tried to optimize it with ngx.sleep(0), but it didn't work well, the long-tailed requests still existed after the optimization and the latency was bigger, and optimized string.format, with very little improvement.

tzssangglass commented 2 years ago

hi @knyar, do you have any plans to release a new version?

knyar commented 2 years ago

I've just shipped a new release 0.20220127

tzssangglass commented 2 years ago

I've just shipped a new release 0.20220127

many thanks!