Closed salapat11 closed 5 years ago
The "-pollInterval" command line parameter can be used to collect channel status less frequently than the Prometheus scrape interval but there's no way to collect some channel status at one period and a different set at another period.
answered
pollInterval parameter collects the sample less frequently but whenever the samples are collected, CPU is spiking if a queue manager has more than 500 channels. Could not understand which process is consuming the CPU? Is it during the sample collection (running the inquire channel command and processing the data) or during sending the response?
I set up 2 qmgrs with 500 pairs of channels between them (so 1000 total on each), and monitored all the channels of one of the qmgrs via a client connection. All running on a single box, with everything else also running there (prometheus, grafana servers, other work too). There's no noticeable CPU bump at the collection intervals. Also running a prometheus collector for the system stats like CPU. The box keeps going at around 9-12% CPU at steady-state.
Clearly there's work going on for the qmgr to generate all the status responses and then for the collector to get them off the replyQ and parse them. And that might be a reasonable amount of work for a short period, but it;'s not showing up on this machine.
I don't know what "400%" usage can mean for a system. Surely it can never be more than 100%. How are you measuring that? And how long does it last? Maybe if you lowered the collector process priority it would use the same amount of total CPU spread out over a longer period and not show a spike?
Mark, thanks again for your response. When we monitored the CPU via "top" command on the server where MQ and Prometheus process are running, we noticed it went to "400%". It is a 24 CPU machine, so the CPU consumption is shown as > 100%. It lasts for couple of seconds whenever the scrape happens.
We have some queue managers in production which have 3000+ channels. When Prometheus process runs to collect channel data, CPU spikes to 400%. Is there anyway to get the data in multiple batches?