it-novum / openitcockpit-agent-go

Cross-Platform Monitoring Agent for openITCOCKPIT written in Go
https://openitcockpit.io/download_agent/
Apache License 2.0
5 stars 2 forks source link

wrong agent reports about CPU usage percentage #67

Closed exa-mk closed 2 years ago

exa-mk commented 2 years ago

Agent Mode:

Versions

Operating system Please provide detailed operating system description (please do not just say "windows" or "linux"), installed antivirus, anything else that could be helpful information CentOS 7

Describe the bug We have a multicore vm (proxmox qemu) with CentOS running. In the vm a java process is running which occasionally uses a bit cpu, but just for moments on a higher scale. However, the openitcockpit-agent reports permantly high cpu usage percentage (not cpu load). When I look into the json data the agent provides to the server it get's confusing. The overall cpu percentage values per core are >90%, in the detailed data the core are mostly idling:

    "cpu": {
        "cpu_total_percentage": 93.31619535857195,
        "cpu_percentage": [
            94.9494949729538,
            90.7216494253015,
            94.84536080602847,
            95.87628863882199,
            97.00000004528556,
            95.9183673063922,
            93.939393961665,
            97.02970298992592
        ],
        "cpu_total_percentage_detailed": {
            "User": 3.0498302651464524,
            "Nice": 0.0008836756245381338,
            "System": 2.3222081128788647,
            "Idle": 0,
            "Iowait": 0.2657092991871487
        },
        "cpu_percentage_detailed": [
            {
                "User": 2.9821197101994827,
                "Nice": 0.0006286550363650479,
                "System": 2.3251162828371355,
                "Idle": 94.14644937469919,
                "Iowait": 0.28717187840741965
            },
            {
                "User": 2.7423516923218765,
                "Nice": 0.0004413339014350094,
                "System": 2.1970944342678522,
                "Idle": 94.54973373361082,
                "Iowait": 0.271350739471736
            },
            {
                "User": 3.099431842030061,
                "Nice": 0.001206280190953449,
                "System": 2.337755137960008,
                "Idle": 94.0481285338807,
                "Iowait": 0.26045670332371107
            },
            {
                "User": 3.1353064779642503,
                "Nice": 0.0007755201079033074,
                "System": 2.3267913221959056,
                "Idle": 94.03181978399546,
                "Iowait": 0.2529458108375193
            },
            {
                "User": 3.1548493776460447,
                "Nice": 0.0008497575853960074,
                "System": 2.3445164523950646,
                "Idle": 93.98026823379509,
                "Iowait": 0.25625641077105304
            },
            {
                "User": 3.1133073091819634,
                "Nice": 0.0008726655341579738,
                "System": 2.347272050009365,
                "Idle": 94.01890774853698,
                "Iowait": 0.2555805955615638
            },
            {
                "User": 3.1298523868007604,
                "Nice": 0.0015841396195319306,
                "System": 2.3541260290605366,
                "Idle": 93.98625742297747,
                "Iowait": 0.25243670574560345
            },
            {
                "User": 3.0408763344229244,
                "Nice": 0.0007092811409876129,
                "System": 2.3447895870386293,
                "Idle": 94.05357570272128,
                "Iowait": 0.28949645001789487
            }
        ]
    },

To Reproduce Steps to reproduce the behavior: not sure

Expected behavior Consistent cpu percentage usage data in the agent jsons.

Screenshots see snippet above

Additional context n/a

nook24 commented 2 years ago

I will take a look at this

nook24 commented 2 years ago

I was able to reproduce this. Hopefully I can fix it as well^^

{
    "agent": {
        "last_updated": "2022-04-21 08:34:45.822389661 +0200 CEST m=+1591.430854078",
        "last_updated_timestamp": 1650522885,
        "system": "ubuntu",
        "system_uptime": 1596,
        "kernel_version": "5.4.0-100-generic",
        "mac_version": "20.04",
        "windows_release_id": "",
        "windows_current_build": "",
        "family": "debian",
        "agent_version": "3.0.8",
        "temperature_unit": "C",
        "goos": "linux",
        "goarch": "amd64"
    },
    "cpu": {
        "cpu_total_percentage": 21.410579345085928,
        "cpu_percentage": [
            66.66666666668199,
            49.00000000000091,
            55.999999999988916,
            49.504950495055574
        ],
        "cpu_total_percentage_detailed": {
            "User": 4.004988695024234,
            "Nice": 0.27813078453693746,
            "System": 2.259891459619005,
            "Idle": 0,
            "Iowait": 0.6525982523800364
        },
        "cpu_percentage_detailed": [
            {
                "User": 3.210962910667911,
                "Nice": 0.37757494768160166,
                "System": 1.8721161846650365,
                "Idle": 93.96132220569326,
                "Iowait": 0.17964751266987725
            },
            {
                "User": 3.32591579305693,
                "Nice": 0.27800016390032345,
                "System": 1.9472619190206326,
                "Idle": 94.01133433774814,
                "Iowait": 0.15318376378181087
            },
            {
                "User": 3.327618159125579,
                "Nice": 0.1828015279686338,
                "System": 1.943999697431954,
                "Idle": 94.13144060211042,
                "Iowait": 0.18910502893306944
            },
            {
                "User": 6.158919062328262,
                "Nice": 0.2735191747680141,
                "System": 3.2778083091714203,
                "Idle": 87.87987896934439,
                "Iowait": 2.0908740611596457
            }
        ]
    },
}
nook24 commented 2 years ago

This issue was resolved with version 3.0.9 of the agent.

The issue was caused by too many measurements, which were split across 2 seconds. The measurements itself where correct but two seconds are a lot of time for a computer, and if the kernel moves a process from one cpu core to another one, the results get quite confusing (and wrong).

We resolved this by doing only one measurement, and by calculating all the required numbers by our self instead of calling multiple functions from gopsutil.

The calculations are the same as htop uses. https://github.com/it-novum/openitcockpit-agent-go/blob/50ef9028675b8e70a989dc07771ed8048347e960/checks/cpu_posix.go#L29-L45

Feel free to comment/reopen if you are still experiencing the same issue with version 3.0.9