influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

incorrect display of mem_used_percent MacOs #10632

Closed tetesh closed 2 years ago

tetesh commented 2 years ago

Relevant telegraf.conf

[global_tags]
  dc = "denver-1"

[agent]
  interval = "10s"

[[inputs.exec]]
  commands = [ "/etc/telegraf/hardware/temp.sh",
    "/etc/telegraf/hardware/periphery.sh",
    "/etc/telegraf/hardware/cpuusage.sh",
    "/etc/telegraf/hardware/serialnumber.sh",
    "/etc/telegraf/hardware/osversion.sh",
    "/etc/telegraf/service/ldap.sh",
    "/etc/telegraf/service/krb.sh",
    "/etc/telegraf/service/iscsi.sh",
    "/etc/telegraf/service/moulinette.sh",
    "/etc/telegraf/service/vogsphere.sh",
    "/etc/telegraf/alert/examusb.sh",
    "/etc/telegraf/login.sh",
    "/etc/telegraf/exam.sh" ]
  timeout = "10s"
  data_format = "influx"

[[inputs.disk]]
  tagexclude = ["fstype"]

[[inputs.mem]]

[[inputs.net]]
  interfaces = ["en*", "lo0"]

[[outputs.prometheus_client]]
  listen = ":9273"
  metric_version = 2

Logs from Telegraf

[root] pr-h4 [~] # tail -f /var/log/telegraf.err.log
2022-02-11T07:19:53Z I! [agent] Stopping running outputs
2022-02-11T07:19:57Z I! Starting Telegraf 1.21.3
2022-02-11T07:19:57Z I! Using config file: /etc/telegraf/telegraf.conf
2022-02-11T07:19:57Z I! Loaded inputs: disk exec mem net
2022-02-11T07:19:57Z I! Loaded aggregators: 
2022-02-11T07:19:57Z I! Loaded processors: 
2022-02-11T07:19:57Z I! Loaded outputs: prometheus_client
2022-02-11T07:19:57Z I! Tags enabled: dc=denver-1 host=pr-h4.kzn.21-school.ru
2022-02-11T07:19:57Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"pr-h4.kzn.21-school.ru", Flush Interval:10s
2022-02-11T07:19:57Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics

System info

Telegraf 1.17.2 (git: HEAD 74011e22), MacOs Mojave 10.14.6

Docker

No response

Steps to reproduce

  1. I look at the metrics at http://hostname:9273/metrics

Expected behavior

I want to see real mem_used_percent indicators (90%)

# HELP mem_total Telegraf collected metric
# TYPE mem_total gauge
mem_total{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 8.589934592e+09
# HELP mem_used Telegraf collected metric
# TYPE mem_used gauge
mem_used{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 7.747760128e+09
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent gauge
mem_used_percent{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 90.19579887390137

Actual behavior

I see not real indicators, because no one uses the iMac and the top utility also issues:

Load Avg: 1.50, 1.02, 0.70 CPU usage: 0.32% user, 0.32% sys, 99.34% idle SharedLibs: 119M resident, 39M data, 28M linkedit. MemRegions: 14416 total, 1087M resident, 77M private, 560M shared. PhysMem: 4150M used (1419M wired), 4042M unused. VM: 802G vsize, 1372M framework vsize, 0(0) swapins, 0(0) swapouts. Networks: packets: 18793774/4189M in, 8271302/2404M out. Disks: 1902868/27G read, 11020419/85G written.

Additional info

No response

powersj commented 2 years ago

Hi,

I am not sure I follow, I see the following on my M1:

# HELP mem_active Telegraf collected metric
# TYPE mem_active gauge
mem_active{host="mbp"} 6.01956352e+09
# HELP mem_available Telegraf collected metric
# TYPE mem_available gauge
mem_available{host="mbp"} 9.091268608e+09
# HELP mem_available_percent Telegraf collected metric
# TYPE mem_available_percent gauge
mem_available_percent{host="mbp"} 52.918148040771484
# HELP mem_free Telegraf collected metric
# TYPE mem_free gauge
mem_free{host="mbp"} 8.527872e+08
# HELP mem_inactive Telegraf collected metric
# TYPE mem_inactive gauge
mem_inactive{host="mbp"} 8.238481408e+09
# HELP mem_total Telegraf collected metric
# TYPE mem_total gauge
mem_total{host="mbp"} 1.7179869184e+10
# HELP mem_used Telegraf collected metric
# TYPE mem_used gauge
mem_used{host="mbp"} 8.088600576e+09
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent gauge
mem_used_percent{host="mbp"} 47.081851959228516
# HELP mem_wired Telegraf collected metric
# TYPE mem_wired gauge
mem_wired{host="mbp"} 1.386201088e+09

Those percent values look correct for my system.

Can you point out what you think should be different?

tetesh commented 2 years ago

telegraf says that 90% of RAM is occupied, although this is not so. I attached the output of the top command above, where you can see that the memory is only half occupied

we have more than 500 imacs, and some of them get this bug from time to time

powersj commented 2 years ago

@tetesh in the same terminal, can you please get the output of vm_stat and then run telegraf to collect the memory information using the following config and command:

[[inputs.mem]]
[[outputs.file]]
telegraf --config config.toml --once --debug

On macOS, Telegraf users the gopsutil library to find the following fields:

fields["active"] = vm.Active
fields["free"] = vm.Free
fields["inactive"] = vm.Inactive
fields["wired"] = vm.Wired
tetesh commented 2 years ago

top: PhysMem: 4329M used (1426M wired), 3861M unused.

metrics:

# TYPE mem_total gauge
mem_total{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 8.589934592e+09
# HELP mem_used Telegraf collected metric
# TYPE mem_used gauge
mem_used{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 7.3059328e+09
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent gauge
mem_used_percent{dc="denver-1",host="pr-h4.kzn.21-school.ru"} 85.0522518157959

vm_stat and debug telegraf

[root] pr-h4 [~] # vm_stat                                                            
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                               98230.
Pages active:                            522286.
Pages inactive:                          217409.
Pages speculative:                       893692.
Pages throttled:                              0.
Pages wired down:                        365059.
Pages purgeable:                           5001.
"Translation faults":               25978118772.
Pages copy-on-write:                 3394766164.
Pages zero filled:                   3347016987.
Pages reactivated:                        38870.
Pages purged:                             27637.
File-backed pages:                      1193587.
Anonymous pages:                         439800.
Pages stored in compressor:                   0.
Pages occupied by compressor:                 0.
Decompressions:                               0.
Compressions:                                 0.
Pageins:                                1415887.
Pageouts:                                     9.
Swapins:                                      0.
Swapouts:                                     0.
[root] pr-h4 [~] # 
[root] pr-h4 [~] # telegraf --config config.toml --once --debug                     
2022-02-14T15:55:17Z I! Starting Telegraf 1.17.2
2022-02-14T15:55:17Z D! [agent] Initializing plugins
2022-02-14T15:55:17Z D! [agent] Connecting outputs
2022-02-14T15:55:17Z D! [agent] Attempting connection to [outputs.file]
2022-02-14T15:55:17Z D! [agent] Successfully connected to outputs.file
2022-02-14T15:55:17Z D! [agent] Starting service inputs
2022-02-14T15:55:17Z D! [agent] Stopping service inputs
2022-02-14T15:55:17Z D! [agent] Input channel closed
2022-02-14T15:55:17Z I! [agent] Hang on, flushing any cached metrics before shutdown
mem,host=pr-h4.kzn.21-school.ru wired=1495707648i,used=7311187968i,available_percent=14.88656997680664,active=2153500672i,inactive=890507264i,total=8589934592i,available=1278746624i,used_percent=85.11343002319336,free=388239360i 1644854118000000000
2022-02-14T15:55:17Z D! [outputs.file] Wrote batch of 1 metrics in 62.409µs
2022-02-14T15:55:17Z D! [outputs.file] Buffer fullness: 0 / 10000 metrics
2022-02-14T15:55:17Z D! [agent] Stopped Successfully
powersj commented 2 years ago

The difference of opinion here is that top and gopsutil do not determine the used % the same way.

Here is the calc Telegraf does for used percentage:

used percent = used / total * 100
used percent = 7311187968 / 8589934592 * 100
used percent = 85.11343002319336

Now let's look at the vm_stat output and I've multiplied the values by the page size and converted to Mb:

Pages free:          98230    402 Mb
Pages active:       522286   2139 Mb
Pages inactive:     217409    890 Mb
Pages speculative:  893692   3661 Mb
Pages wired down:   365059   1495 Mb
------------------------------------
Total:             2096676   8585 Mb

Some definitions:

Here is on way to calculate used percentage using vm_stat:

used percent = (total - free) / total * 100
used percent = (2096676 - 98230 - 217409) / 2096676 * 100
used percent = 94.95%

top is reporting a different set of values:

PhysMem: 4150M used (1419M wired), 4042M unused.

My guess is they are using pages active + wired or active + inactive + wired.

The way memory is calculated can vary from one tool to the other depending on if different classifications are included in the total. Consider looking at linuxatemyram.com.

As I said, these are using different ways to calculate free. As such I do not consider this a bug in Telegraf and will be closing this.