performancecopilot / pcp

Performance Co-Pilot
https://pcp.io
Other
974 stars 237 forks source link

pmseries: invalid SID is generated #1084

Open andreasgerstmayr opened 4 years ago

andreasgerstmayr commented 4 years ago

Setup: 3 nodes (running pmcd) + 1 collector node (running pmlogger + pmproxy + redis, pmlogger connects to the other nodes)

All running PCP 5.1.1-3

[vagrant@collector pmlogger]$ pmseries -a $(pmseries disk.dev.write_bytes)

270585f657a2d61504103f2d093ff0cb23342fc6
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: 6fcbbe42a3f6db942a08ffa1b972284670266186
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series 4a85e858fceb6477acde2b77e4f702f5977260b9
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","hostname":"node2.local","indom_name":"per disk","machineid":"28a6876d9ab54e71a3a046d2f36cc5a6"}

353820b21b1553ee00159ab6ed44a76a04348b04
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: ae072a02a41c16176c6337db60a8870145369755
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series 9a5b93a8e679787809501b82e4c3df9848f714d8
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","hostname":"node1.local","indom_name":"per disk","machineid":"a00fdef1932c47808f882e2506a50483"}

710e5b8c629d53ee8bcd96b570bab679fb9481c2
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: 19fb7ed79c6ab2ae759caa242427947d54b5ee99
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series f4d23dbc843d88ca38094d0ab4f8dbdfa99283c6
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","hostname":"node3.local","indom_name":"per disk","machineid":"629d38f3a66940d48805e9901416dc88"}

97ff84b72e097e02e2f78dfdd3dd4a9585f8269c
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: 1142716172cbf467b397ab916f07799381c2ebd8
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series a263678c1545089a08a3ab8ae09dd27d42fd9211
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":990,"hostname":"collector.local","indom_name":"per disk","machineid":"4ce998b207094627905735000844af59","userid":993}

Filtering based on the agent label, I get the following series:

[vagrant@collector pmlogger]$ pmseries -a $(pmseries 'disk.dev.write_bytes{agent=="linux"}')

27357197ff84b72e097e02e2f78dfdd3dd4a9585
    PMID: PM_ID_NULL
    Data Type: ???  InDom: unknown 0xffffffff
    Semantics: unknown  Units: unknown
    Source: unknown

f8269cb21b1553ee00159ab6ed44a76a04348b04
    PMID: PM_ID_NULL
    Data Type: ???  InDom: unknown 0xffffffff
    Semantics: unknown  Units: unknown
    Source: unknown

710e5b8c629d53ee8bcd96b570bab679fb9481c2
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: 19fb7ed79c6ab2ae759caa242427947d54b5ee99
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series f4d23dbc843d88ca38094d0ab4f8dbdfa99283c6
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","hostname":"node3.local","indom_name":"per disk","machineid":"629d38f3a66940d48805e9901416dc88"}

97ff84b72e097e02e2f78dfdd3dd4a9585f8269c
    PMID: 60.0.39
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: Kbyte
    Source: 1142716172cbf467b397ab916f07799381c2ebd8
    Metric: disk.dev.write_bytes
    inst [0 or "vda"] series a263678c1545089a08a3ab8ae09dd27d42fd9211
    inst [0 or "vda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":990,"hostname":"collector.local","indom_name":"per disk","machineid":"4ce998b207094627905735000844af59","userid":993}

All nodeX VMs have the same configuration (except the hostname). Any idea why the filtering works for node3.local but not for node1.local and node2.local?

andreasgerstmayr commented 4 years ago

fyi, similar situation currently with latest PCP from master

$ pmseries -a $(pmseries disk.dev.read)

1da966685fbfa7b61f9e44c0a7c3e0fed6a387f4
    PMID: 60.0.4
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: count
    Source: e8d3bc6b62ea77a67278009a8ad5cc44d162b7a8
    Metric: disk.dev.read
    inst [0 or "nvme0n1"] series d566584c9425cf8db1fff9fd431df425ca3ab7f5
    inst [1 or "sda"] series cf7fec1925ede13041bff286d2e100e67eced184
    inst [0 or "nvme0n1"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":1001,"hostname":"agerstmayr-thinkpad","indom_name":"per disk","machineid":"18b2c288e7c54055bf296618861c6dc5","userid":1001}
    inst [1 or "sda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":1001,"hostname":"agerstmayr-thinkpad","indom_name":"per disk","machineid":"18b2c288e7c54055bf296618861c6dc5","userid":1001}

f87250c4ea0e5eca8ff2ca3b3044ba1a6c91a3d9
    PMID: 60.0.4
    Data Type: 64-bit unsigned int  InDom: 60.1 0xf000001
    Semantics: counter  Units: count
    Source: 2914f38f7bdcb7fb3ac0b822c98019248fd541fb
    Metric: disk.dev.read
    inst [0 or "nvme0n1"] series 7f3afb6f41e53792b18e52bcec26fdfa2899fa58
    inst [1 or "sda"] series 0aeab8b239522ab0640577ed788cc601fc640266
    inst [0 or "nvme0n1"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":976,"hostname":"agerstmayr-thinkpad","indom_name":"per disk","machineid":"6dabb302d60b402dabcc13dc4fd0fab8","userid":978}
    inst [1 or "sda"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":976,"hostname":"agerstmayr-thinkpad","indom_name":"per disk","machineid":"6dabb302d60b402dabcc13dc4fd0fab8","userid":978}

btw, no idea how the userid 1001 appeared here (that's my "pcptestuser" to test authentication, I can't remember that this user started the pmcd or pmlogger daemon)

$ pmseries -a $(pmseries 'disk.dev.read{hostname=="agerstmayr-thinkpad"}')

1df87250c4ea0e5eca8ff2ca3b3044ba1a6c91a3
    PMID: PM_ID_NULL
    Data Type: ???  InDom: unknown 0xffffffff
    Semantics: unknown  Units: unknown
    Source: unknown

d97250c4ea0e5eca8ff2ca3b3044ba1a6c91a3d9
    PMID: PM_ID_NULL
    Data Type: ???  InDom: unknown 0xffffffff
    Semantics: unknown  Units: unknown
    Source: unknown´

Did the recent changes to the pmseries lang break the filtering?

natoscott commented 4 years ago

@andreasgerstmayr looks like your 'machineid' label has changed?! The userid/groupid change should not cause an issue - that label is tagged as "optional" and so not used in hash calculations - but the machineid label change is probably a part of the problem.

FWIW, I'm not seeing any issues here. We also have QA tests that verify hash calculation consistency and many other aspects of pmseries, to the best of my knowledge the recent language changes are not causing your issue here.

I do wonder whether at some point we're going to need a Redis key cleaning/checking tool that can go in and looks for disconnected keys, series missing labels, and so on. Hmm, big job that one.

andreasgerstmayr commented 4 years ago

@andreasgerstmayr looks like your 'machineid' label has changed?! The userid/groupid change should not cause an issue - that label is tagged as "optional" and so not used in hash calculations - but the machineid label change is probably a part of the problem.

Jan pointed out that the pcp user inside the pcp-container also has UID 1001 - probably I ran the container with --network=host and the pmlogger inside the container wrote to the redis database of the host. So this mystery is solved, works as expected, I got confused because UID 1001 is pcptestuser on my local system, didn't think of the container user which has the same UID.

FWIW, I'm not seeing any issues here. We also have QA tests that verify hash calculation consistency and many other aspects of pmseries, to the best of my knowledge the recent language changes are not causing your issue here.

The main issue of this bug is: In both cases I can see all series, but when I filter them I get wrong series.

The first example shows 4 series, and all of them have the agent: linux label. However, filtering returns only two valid series. That's definitely an issue, if I run a pmseries query which should match all of them, but only returns 2 out of 4 valid series. The other two are apparently invalid, have no metadata attached, but they should have the same metadata.

Some more context:

[vagrant@collector ~]$ pmseries 'disk.dev.write_bytes[count:1]'

270585f657a2d61504103f2d093ff0cb23342fc6
    [Thu Oct  8 08:35:49.488355000 2020] 2216138 4a85e858fceb6477acde2b77e4f702f5977260b9

353820b21b1553ee00159ab6ed44a76a04348b04
    [Thu Oct  8 08:35:46.461776000 2020] 3090841 9a5b93a8e679787809501b82e4c3df9848f714d8

710e5b8c629d53ee8bcd96b570bab679fb9481c2
    [Thu Oct  8 08:35:52.668167000 2020] 2238512 f4d23dbc843d88ca38094d0ab4f8dbdfa99283c6

97ff84b72e097e02e2f78dfdd3dd4a9585f8269c
    [Thu Oct  8 08:36:34.493999000 2020] 576251240 a263678c1545089a08a3ab8ae09dd27d42fd9211

Shows that all series indeed have values.

However, filtering based on the agent, which should match all series:

[vagrant@collector ~]$ pmseries 'disk.dev.write_bytes{agent=="linux"}[count:1]'

710e5b8c629d53ee8bcd96b570bab679fb9481c2
    [Thu Oct  8 08:35:52.668167000 2020] 2238512 f4d23dbc843d88ca38094d0ab4f8dbdfa99283c6

97ff84b72e097e02e2f78dfdd3dd4a9585f8269c
    [Thu Oct  8 08:35:54.466130000 2020] 576016286 a263678c1545089a08a3ab8ae09dd27d42fd9211

Only returns two series.

All nodes1-3 have exactly the same configuration (https://github.com/andreasgerstmayr/pcp-and-grafana-demo).