rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.33k stars 3.92k forks source link

`badarith` errors when collecting metrics #12815

Open mkuratczyk opened 4 days ago

mkuratczyk commented 4 days ago

Describe the bug

I don't have the exact repro steps. I think this is a race condition when a queue is deleted while the metrics are collected and the queue is found by the collector but then returns empty values ('') instead of the expected numbers. I've seen this occur in 3 different places:

crasher:
  initial call: rabbit_mgmt_metrics_collector:init/1
  pid: <0.699.0>
  registered_name: connection_coarse_metrics_metrics_collector
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  -/2
       called as '' - 19495760921
    in call from rabbit_mgmt_metrics_collector:difference/2 (rabbit_mgmt_metrics_collector.erl, line 649)
    in call from rabbit_mgmt_metrics_collector:aggregate_entry/4 (rabbit_mgmt_metrics_collector.erl, line 234)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from ets:do_foldl/4 (ets.erl, line 656)
    in call from ets:foldl/3 (ets.erl, line 645)
    in call from rabbit_mgmt_metrics_collector:aggregate_metrics/2 (rabbit_mgmt_metrics_collector.erl, line 172)
    in call from rabbit_mgmt_metrics_collector:handle_info/2 (rabbit_mgmt_metrics_collector.erl, line 106)
crasher:
  initial call: rabbit_mgmt_metrics_collector:init/1
  pid: <0.7180.0>
  registered_name: connection_coarse_metrics_metrics_collector
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  +/2
       called as 19340950184 + ''
    in call from rabbit_mgmt_metrics_collector:sum_entry/2 (rabbit_mgmt_metrics_collector.erl, line 631)
    in call from rabbit_mgmt_metrics_collector:'-insert_entry_op/4-fun-0-'/2 (rabbit_mgmt_metrics_collector.erl, line 598)
    in call from maps:update_with/4 (maps.erl, line 296)
    in call from rabbit_mgmt_metrics_collector:insert_entry_op/4 (rabbit_mgmt_metrics_collector.erl, line 597)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from rabbit_mgmt_metrics_collector:aggregate_entry/4 (rabbit_mgmt_metrics_collector.erl, line 235)
    in call from lists:foldl/3 (lists.erl, line 1594)
crasher:
  initial call: cowboy_stream_h:request_process/3
  pid: <0.95144.0>
  registered_name: []
  exception error: an error occurred when evaluating an arithmetic expression
    in operator  +/2
       called as '' + 4193012
    in call from prometheus_rabbitmq_core_metrics_collector:'-get_data/3-fun-5-'/3 (prometheus_rabbitmq_core_metrics_collector.erl, line 648)
    in call from lists:foldl/3 (lists.erl, line 1594)
    in call from ets:do_foldl/4 (ets.erl, line 656)
    in call from ets:foldl/3 (ets.erl, line 645)
    in call from prometheus_rabbitmq_core_metrics_collector:get_data/3 (prometheus_rabbitmq_core_metrics_collector.erl, line 633)
    in call from prometheus_rabbitmq_core_metrics_collector:'-collect/5-lc$^0/1-0-'/6 (prometheus_rabbitmq_core_metrics_collector.erl, line 327)
    in call from prometheus_rabbitmq_core_metrics_collector:collect/5 (prometheus_rabbitmq_core_metrics_collector.erl, line 329)

Observed on the main branch on November 26

Reproduction steps

Not clear

Expected behavior

No crashes

Additional context

No response

mkuratczyk commented 3 days ago

I'm not sure if that's because I'm testing a slightly different scenario perhaps these errors are caused by some recent changes but I haven't seen them in the past at all and now I see them very often.

michaelklishin commented 3 days ago

We have seen these before, usually on a freshly booted node when some values are not yet available. '' is used as undefined historically by a few metrics that come from the environment, e.g. free disk space, which is not computed instantly the node starts booting.

The Prometheus plugin should filter such values out because if a data point isn't available yet, what else can it do?

mkuratczyk commented 1 day ago

I've seen this enough times now to say that it's certainly not about a freshly booted node, but about a queue getting deleted.

michaelklishin commented 22 hours ago

@mkuratczyk yup, that can be another case where some metrics no longer exist. During a boot, they do not yet exist, and after queue deletion, they no longer exist.

I am all for making the code more defensive but if some samples are missing… what other than an error can the Prometheus scraping API endpoint return?