rabbitmq / rabbitmq-prometheus

A minimalistic Prometheus exporter of core RabbitMQ metrics
Other
145 stars 109 forks source link

Consumer utilisation is reported as NaN by this plugin but not CLI tools #32

Closed CrazyMushu closed 4 years ago

CrazyMushu commented 4 years ago

I have 3 rabbitmq hosts which are united to the cluster. НА has following settings:

ha-mode:    all
ha-sync-mode:   automatic

When I turn on the setting of _rabbitmqctl eval 'application:set_env(rabbitmq_prometheus, return_per_objectmetrics, true).' I get the result by each queue, but the r_abbitmq_queue_consumerutilisation parameter in all queues is visible in NaN. There are near 800 queues in the cluster and the _collect_statisticsinterval parameter is used in 15000.

Upon the request to queues via rabbitmqctl, I get the correct data:

$ rabbitmqctl --vhost=test list_queues name messages messages_ready state consumer_utilisation --formatter json | column -tx -c "name, messages, messages_ready, state,consumer_utilisation"| grep -e TestRequestQueue

{"name":"TestRequestQueue","messages":0,"messages_ready":0,"state":"running","consumer_utilisation":1.0}

But via a _rabbitmqctl --vhost test eval 'io:format("QUEUE STATS:~n~p~n~nQUEUE METRICS:~n~p~n~n", [ets:tab2list(queue_stats), ets:tab2list(queuemetrics)]).' request I get the following:

 {{resource,<<"test">>,queue,<<"TestRequestQueue">>},
  [{idle_since,<<"2020-04-09 11:10:38">>},
   {consumer_utilisation,''},
   {policy,<<"ha-all">>},
   {operator_policy,''},
   {effective_policy_definition,[{<<"ha-mode">>,<<"all">>},
                                 {<<"ha-sync-mode">>,<<"automatic">>}]},
   {exclusive_consumer_tag,''},
   {single_active_consumer_tag,''},
   {consumers,6},
   {memory,58020},
   {slave_nodes,['rabbit@rabbit-02','rabbit@rabbit-03']},
   {synchronised_slave_nodes,['rabbit@rabbit-03',
                              'rabbit@rabbit-02']},
   {recoverable_slaves,''},
   {state,running},
   {garbage_collection,[{max_heap_size,0},
                        {min_bin_vheap_size,46422},
                        {min_heap_size,233},
                        {fullsweep_after,65535},
                        {minor_gcs,226}]},
   {messages_ram,0},
   {messages_ready_ram,0},
   {messages_unacknowledged_ram,0},
   {messages_persistent,0},
   {message_bytes,0},
   {message_bytes_ready,0},
   {message_bytes_unacknowledged,0},
   {message_bytes_ram,0},
   {message_bytes_persistent,0},
   {head_message_timestamp,''},
   {backing_queue_status,#{avg_ack_egress_rate => 0.08002020176251025,
                           avg_ack_ingress_rate => 0.08002020176251025,
                           avg_egress_rate => 0.08002020176251025,
                           avg_ingress_rate => 0.08002020176251025,
                           delta => [delta,undefined,0,0,undefined],
                           len => 0,mirror_seen => 0,mirror_senders => 6,
                           mode => default,next_seq_id => 70,q1 => 0,q2 => 0,
                           q3 => 0,q4 => 0,target_ram_count => infinity}},
   {messages_paged_out,0},
   {message_bytes_paged_out,0}]},

Upon seen via a rabbitmq_prometheus plugin I get the following:

$ curl -s localhost:15692/metrics | grep TestRequestQueue
rabbitmq_queue_messages_published_total{channel="<0.13860.0>",queue_vhost="test",queue="TestRequestQueue",exchange_vhost="test",exchange="TestExchange"} 11
rabbitmq_queue_messages_published_total{channel="<0.13864.0>",queue_vhost="test",queue="TestRequestQueue",exchange_vhost="test",exchange="TestExchange"} 10
rabbitmq_queue_messages_ready{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_process_reductions_total{vhost="test",queue="TestRequestQueue"} 287651
rabbitmq_queue_consumers{vhost="test",queue="TestRequestQueue"} 6
rabbitmq_queue_consumer_utilisation{vhost="test",queue="TestRequestQueue"} NaN
rabbitmq_queue_process_memory_bytes{vhost="test",queue="TestRequestQueue"} 57956
rabbitmq_queue_messages_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ram_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_persistent{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_persistent_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_paged_out{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_paged_out_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_reads_total{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_writes_total{vhost="test",queue="TestRequestQueue"} 0

NaN value is displayed on each node of our cluster, accordingly - this value is transferred to the prometheus too. Could you, please, advise what could be the problem in?

michaelklishin commented 4 years ago

The problem is likely in the plugin since both CLI tools and metric tables suggest you do have consumers online.

On an unrelated note, "ha-mode": "all" is highly discouraged and completely unnecessary most of the time.

CrazyMushu commented 4 years ago

6 consumers who are online are connected to this queue.

$ rabbitmqctl --vhost=test list_queues name messages messages_ready state consumers consumer_utilisation --formatter json | grep -e TestRequestQueue

,{"name":"TestRequestQueue","messages":0,"messages_ready":0,"state":"running","consumers":6,"consumer_utilisation":1.0}
michaelklishin commented 4 years ago

The number of messages ready is not very interesting when it comes to reasoning about consumer utilisation. Delivery rate > 0 is.

CrazyMushu commented 4 years ago

I managed to conduct one more series of experiments.

When messages appear in the queue and consumers start working, the consumer_utilisation parameter in the rabbitmq_prometheus plugin goes from the state NaN to the state 1.0. When consumers stop coping, they start to aim to 0.

$ rabbitmqctl --vhost=test list_queues name messages messages_ready state consumers consumer_utilisation --formatter json | grep -e TestRequestQueue

,{"name":"TestRequestQueue","messages":170,"messages_ready":164,"state":"running","consumers":6,"consumer_utilisation":0.028672856706219607}
curl -s localhost:15692/metrics | grep TestRequestQueue
rabbitmq_channel_get_ack_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_ack_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_delivered_ack_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 899
rabbitmq_channel_messages_delivered_ack_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 906
rabbitmq_channel_messages_delivered_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_delivered_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_redelivered_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_redelivered_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_acked_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 898
rabbitmq_channel_messages_acked_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 905
rabbitmq_channel_get_empty_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_empty_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked{vhost="test",queue="TestRequestQueue"} 1
rabbitmq_queue_messages{vhost="test",queue="TestRequestQueue"} 1
rabbitmq_queue_process_reductions_total{vhost="test",queue="TestRequestQueue"} 5178725
rabbitmq_queue_consumers{vhost="test",queue="TestRequestQueue"} 6
rabbitmq_queue_consumer_utilisation{vhost="test",queue="TestRequestQueue"} 0.9008939984752331
rabbitmq_queue_process_memory_bytes{vhost="test",queue="TestRequestQueue"} 429044
rabbitmq_queue_messages_ram{vhost="test",queue="TestRequestQueue"} 1
rabbitmq_queue_messages_ram_bytes{vhost="test",queue="TestRequestQueue"} 331
rabbitmq_queue_messages_ready_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_ram{vhost="test",queue="TestRequestQueue"} 1
rabbitmq_queue_messages_persistent{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_persistent_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_bytes{vhost="test",queue="TestRequestQueue"} 331
rabbitmq_queue_messages_ready_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_bytes{vhost="test",queue="TestRequestQueue"} 331
rabbitmq_queue_messages_paged_out{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_paged_out_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_reads_total{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_writes_total{vhost="test",queue="TestRequestQueue"} 0

When messages stop to go to queue, the consumer_utilisation parameter (after some time) goes to the NaN value.

$ rabbitmqctl --vhost=test list_queues name messages messages_ready state consumers consumer_utilisation --formatter json | grep -e TestRequestQueue

,{"name":"TestRequestQueue","messages":0,"messages_ready":0,"state":"running","consumers":6,"consumer_utilisation":1.0}
curl -s localhost:15692/metrics | grep TestRequestQueue
rabbitmq_channel_get_ack_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_ack_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_delivered_ack_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 1514
rabbitmq_channel_messages_delivered_ack_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 1519
rabbitmq_channel_messages_delivered_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_delivered_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_redelivered_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_redelivered_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_messages_acked_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 1514
rabbitmq_channel_messages_acked_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 1519
rabbitmq_channel_get_empty_total{channel="<0.19696.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_channel_get_empty_total{channel="<0.19691.1>",vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_process_reductions_total{vhost="test",queue="TestRequestQueue"} 8245549
rabbitmq_queue_consumers{vhost="test",queue="TestRequestQueue"} 6
rabbitmq_queue_consumer_utilisation{vhost="test",queue="TestRequestQueue"} NaN
rabbitmq_queue_process_memory_bytes{vhost="test",queue="TestRequestQueue"} 145556
rabbitmq_queue_messages_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ram_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_ram{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_persistent{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_persistent_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_ready_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_unacked_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_paged_out{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_messages_paged_out_bytes{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_reads_total{vhost="test",queue="TestRequestQueue"} 0
rabbitmq_queue_disk_writes_total{vhost="test",queue="TestRequestQueue"} 0

Due to such behaviour of the plugin we get gaps in the data on the prometheus and grafana graph.

4b5c8beb7cb239d3ebd560b220abc97d

Advise, please, is this normal behaviour for the system?

michaelklishin commented 4 years ago

It is. If we oversimplify, consumer utilization is the % of time when RabbitMQ has considered delivering one or more messages to a consumer but could not do it due to the effective prefetch and current rate of acknowledgments.

If there are no acknowledgments flowing it's a good question what this value is since it cannot really be computed.

CrazyMushu commented 4 years ago

@michaelklishin I still have a question - if the plugin calculates the consumer utilization parameter as % of the time divided by the number of messages, then it turns out 0/0 = NaN. In this case, I have two questions:

michaelklishin commented 4 years ago

Different tools can choose to visualize such "no value to report" scenario differently. I think reporting it as 1 (perfect utilization) would be wrong.

On Fri, 10 Apr 2020 at 11:10, CrazyMushu notifications@github.com wrote:

@michaelklishin https://github.com/michaelklishin I still have a question - if the plugin calculates the consumer utilization parameter as % of the time divided by the number of messages, then it turns out 0/0 = NaN. In this case, I have two questions:

  • Why does the rabbitmqctl application display this parameter differently? Why not get the value of the consumer utilization parameter using it?
  • If 0/0 = NaN, then can we set the value NaN in the rabbitmq_prometheus plugin equal to 1? In this case we could remove the holes in the graphs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rabbitmq/rabbitmq-prometheus/issues/32#issuecomment-611929976, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAIQXWAHLQLUOJ2JWUD3DRL3IAZANCNFSM4MEVPUEA .

-- Staff Software Engineer, Pivotal/RabbitMQ