prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11k stars 2.34k forks source link

caller=collector.go:161 msg="collector failed" <...> err="couldn't get dbus connection: write unix @->/run/dbus/> #1970

Open soccer193 opened 3 years ago

soccer193 commented 3 years ago

Host operating system: output of uname -a

$ uname -a
Linux relay-77c3d3 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
  build user:       root@1f76dbbcfa55
  build date:       20200616-12:44:12
  go version:       go1.14.4

node_exporter command line flags

     CGroup: /system.slice/node_exporter.service
             └─693 /usr/local/bin/node_exporter --collector.systemd --collector.textfile --collector.textfile.directory=/var/lib/node_exporter --web.listen-address=localhost:9100 --web.telemetry-path=/metrics

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Left the node_exporter running for four days.

What did you expect to see?

Continued operation, no errors or crashes.

What did you see instead?

Feb 15 11:16:52 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:16:25.113Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=43.302483043 err="couldn't get dbus connection: write unix @->/run/dbus/>
Feb 15 11:17:10 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:16:43.226Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=61.324678937 err="couldn't get dbus connection: dbus: connection closed >

Then the system appeared to hang. A jump in the count of go_goroutines was reported by this node_exporter before the hang. image

Closer: image

Feb 15 11:19:43 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:19:30.050Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=212.803465432 err="couldn't get dbus connection: write unix @->/run/dbus>
Feb 15 11:25:19 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:21:52.326Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=117.599559737 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:26:02 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:26:02.471Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=152.451322805 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:31:14 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:28:43.803Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=248.305289209 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:31:29 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:29:00.829Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=317.861842938 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:31:31 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:28:38.707Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=265.530885084 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:31:55 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:29:04.703Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=274.001391161 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:33:19 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:32:04.979Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=0 old_value=233602.62 n>
Feb 15 11:36:15 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:33:38.687Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=1 old_value=233653.46 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=error ts=2021-02-15T00:41:45.243Z caller=collector.go:161 msg="collector failed" name=systemd duration_seconds=199.064034172 err="couldn't get dbus connection: write unix @: sendmsg: >
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.468Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=0 old_value=233602.63 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.595Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=1 old_value=233653.47 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.597Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=0 old_value=233603.01 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.597Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=1 old_value=233653.74 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.597Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=0 old_value=233603.01 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.597Z caller=cpu_linux.go:247 collector=cpu msg="CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats" cpu=1 old_value=233653.74 n>
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.877Z caller=cpu_linux.go:255 collector=cpu msg="CPU User counter jumped backwards" cpu=0 old_value=71977.62 new_value=71977.58
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.877Z caller=cpu_linux.go:267 collector=cpu msg="CPU System counter jumped backwards" cpu=0 old_value=13458.96 new_value=13458.91
Feb 15 11:41:45 relay-77c3d3 node_exporter[693]: level=warn ts=2021-02-15T00:41:45.877Z caller=cpu_linux.go:255 collector=cpu msg="CPU User counter jumped backwards" cpu=1 old_value=72021.78 new_value=72021.69

The system unhung at 11:41 when the process was reaped by oom-killer.

SuperQ commented 3 years ago

That looks like systemd stopped responding.

Can you give more information about the system this is running on? Hardware? VM Hypervisor type?

soccer193 commented 3 years ago

Hi SuperQ,

It's a DigitalOcean droplet running Ubuntu 20.04. I've been running my application workload pretty close to the memory capacity of the machine (4GB total), so that may complicate the analysis.

Running sudo journalctl -u systemd* returned nothing related to the event.

$ systemctl --version
systemd 245 (245.4-4ubuntu3.2)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
dmesg output for the killed node_exporter ``` [Feb15 11:41] node_exporter invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ +0.000004] CPU: 0 PID: 750 Comm: node_exporter Not tainted 5.4.0-65-generic #73-Ubuntu [ +0.000000] Hardware name: DigitalOcean Droplet, BIOS 20171212 12/12/2017 [ +0.000001] Call Trace: [ +0.000007] dump_stack+0x6d/0x9a [ +0.000003] dump_header+0x4f/0x1eb [ +0.000001] oom_kill_process.cold+0xb/0x10 [ +0.000001] out_of_memory.part.0+0x1df/0x3d0 [ +0.000001] out_of_memory+0x6d/0xd0 [ +0.000003] __alloc_pages_slowpath+0xd5e/0xe50 [ +0.000002] __alloc_pages_nodemask+0x2d0/0x320 [ +0.000002] alloc_pages_current+0x87/0xe0 [ +0.000003] __page_cache_alloc+0x72/0x90 [ +0.000001] pagecache_get_page+0xbf/0x300 [ +0.000002] filemap_fault+0x6b2/0xa50 [ +0.000003] ? unlock_page_memcg+0x12/0x20 [ +0.000003] ? __switch_to_asm+0x34/0x70 [ +0.000002] ? xas_load+0xd/0x80 [ +0.000003] ? xas_find+0x17f/0x1c0 [ +0.000001] ? filemap_map_pages+0x24c/0x380 [ +0.000003] ext4_filemap_fault+0x32/0x46 [ +0.000002] __do_fault+0x3c/0x130 [ +0.000001] do_fault+0x24b/0x640 [ +0.000003] ? __switch_to+0x3b0/0x470 [ +0.000001] __handle_mm_fault+0x4c5/0x7a0 [ +0.000002] handle_mm_fault+0xca/0x200 [ +0.000003] do_user_addr_fault+0x1f9/0x450 [ +0.000002] ? hrtimer_nanosleep+0xc2/0x1a0 [ +0.000001] __do_page_fault+0x58/0x90 [ +0.000002] do_page_fault+0x2c/0xe0 [ +0.000002] do_async_page_fault+0x39/0x70 [ +0.000001] async_page_fault+0x34/0x40 [ +0.000002] RIP: 0033:0x46833d [ +0.000006] Code: Bad RIP value. [ +0.000001] RSP: 002b:000000c00005df18 EFLAGS: 00010202 [ +0.000001] RAX: 0000000000000000 RBX: 0000000000004e20 RCX: 000000000046833d [ +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c00005df18 [ +0.000000] RBP: 000000c00005df28 R08: 00000000032e05af R09: 000000c00005df18 [ +0.000001] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000439750 [ +0.000000] R13: 0000000000000000 R14: 0000000000cb37f8 R15: 0000000000000000 [ +0.000002] Mem-Info: [ +0.000003] active_anon:954151 inactive_anon:41 isolated_anon:0 active_file:4 inactive_file:14 isolated_file:0 unevictable:4623 dirty:0 writeback:0 unstable:0 slab_reclaimable:6386 slab_unreclaimable:12571 mapped:2081 shmem:241 pagetables:2520 bounce:0 free:21108 free_pcp:967 free_cma:0 [ +0.000002] Node 0 active_anon:3816604kB inactive_anon:164kB active_file:16kB inactive_file:56kB unevictable:18492kB isolated(anon):0kB isolated(file):0kB mapped:8324kB dirty:0kB writeback:0kB shmem:964kB shmem_thp: 0kB shmem_pmdmapped:> [ +0.000001] Node 0 DMA free:15788kB min:268kB low:332kB high:396kB active_anon:120kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB paget> [ +0.000002] lowmem_reserve[]: 0 3446 3887 3887 3887 [ +0.000001] Node 0 DMA32 free:61200kB min:59676kB low:74592kB high:89508kB active_anon:3474092kB inactive_anon:24kB active_file:296kB inactive_file:244kB unevictable:16kB writepending:0kB present:3653612kB managed:3563252kB mlocked:16kB> [ +0.000003] lowmem_reserve[]: 0 0 440 440 440 [ +0.000001] Node 0 Normal free:7444kB min:7636kB low:9544kB high:11452kB active_anon:342392kB inactive_anon:140kB active_file:528kB inactive_file:636kB unevictable:18476kB writepending:0kB present:524288kB managed:451552kB mlocked:18476> [ +0.000002] lowmem_reserve[]: 0 0 0 0 0 [ +0.000001] Node 0 DMA: 1*4kB (U) 1*8kB (M) 0*16kB 1*32kB (U) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 0*2048kB 3*4096kB (ME) = 15788kB [ +0.000004] Node 0 DMA32: 440*4kB (UME) 312*8kB (UME) 291*16kB (UME) 164*32kB (UME) 123*64kB (UME) 62*128kB (UME) 15*256kB (ME) 1*512kB (U) 3*1024kB (UM) 0*2048kB 6*4096kB (ME) = 61968kB [ +0.000004] Node 0 Normal: 117*4kB (UME) 344*8kB (UME) 199*16kB (UME) 36*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7556kB [ +0.000005] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ +0.000000] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ +0.000001] 2342 total pagecache pages [ +0.000001] 0 pages in swap cache [ +0.000000] Swap cache stats: add 0, delete 0, find 0/0 [ +0.000001] Free swap = 0kB [ +0.000000] Total swap = 0kB [ +0.000000] 1048473 pages RAM [ +0.000001] 0 pages HighMem/MovableOnly [ +0.000000] 40795 pages reserved [ +0.000000] 0 pages cma reserved [ +0.000001] 0 pages hwpoisoned [ +0.000000] Tasks state (memory values in pages): [ +0.000003] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ +0.000006] [ 349] 0 349 81577 851 679936 0 -250 systemd-journal [ +0.000002] [ 460] 0 460 70033 4484 90112 0 -1000 multipathd [ +0.000001] [ 500] 102 500 22599 766 77824 0 0 systemd-timesyn [ +0.000001] [ 544] 100 544 4638 979 73728 0 0 systemd-network [ +0.000002] [ 546] 101 546 6047 1664 81920 0 0 systemd-resolve [ +0.000001] [ 584] 0 584 4732 850 73728 0 -1000 systemd-udevd [ +0.000001] [ 673] 0 673 60320 511 98304 0 0 accounts-daemon [ +0.000001] [ 679] 0 679 2134 567 57344 0 0 cron [ +0.000001] [ 682] 103 682 1952 719 45056 0 -900 dbus-daemon [ +0.000002] [ 684] 997 684 308860 1453 208896 0 -900 do-agent [ +0.000001] [ 690] 0 690 20472 637 57344 0 0 irqbalance [ +0.000002] [ 692] 0 692 7315 2721 90112 0 0 networkd-dispat [ +0.000001] [ 693] 996 693 179535 5054 180224 0 0 node_exporter [ +0.000001] [ 696] 104 696 56083 511 81920 0 0 rsyslogd [ +0.000001] [ 698] 0 698 217590 3847 249856 0 -900 snapd [ +0.000002] [ 700] 0 700 4247 936 73728 0 0 systemd-logind [ +0.000001] [ 706] 0 706 948 522 49152 0 0 atd [ +0.000001] [ 711] 0 711 3042 637 57344 0 -1000 sshd [ +0.000001] [ 743] 0 743 1838 449 53248 0 0 agetty [ +0.000002] [ 761] 0 761 1457 382 49152 0 0 agetty [ +0.000001] [ 764] 0 764 27018 2743 102400 0 0 unattended-upgr [ +0.000002] [ 781] 0 781 59103 760 98304 0 0 polkitd [ +0.000001] [ 872] 0 872 3445 1036 65536 0 0 sshd [ +0.000001] [ 875] 1000 875 4645 1161 73728 0 0 systemd [ +0.000002] [ 876] 1000 876 25856 879 90112 0 0 (sd-pam) [ +0.000001] [ 970] 1000 970 3480 625 65536 0 0 sshd [ +0.000001] [ 8830] 1000 8830 2154 467 45056 0 0 start-node.sh [ +0.000001] [ 8845] 1000 8845 268444391 926764 7520256 0 0 cardano-node [ +0.000002] [ 13877] 0 13877 652 121 40960 0 0 apt.systemd.dai [ +0.000001] [ 13882] 0 13882 652 381 45056 0 0 apt.systemd.dai [ +0.000001] [ 13890] 0 13890 15629 2636 163840 0 0 apt-get [ +0.000002] [ 13893] 0 13893 3443 1026 65536 0 0 sshd [ +0.000001] [ 13897] 0 13897 652 248 40960 0 0 sh [ +0.000001] [ 13898] 0 13898 624 358 36864 0 0 run-parts [ +0.000001] [ 13948] 0 13948 6525 2324 86016 0 0 check-new-relea [ +0.000001] [ 13950] 0 13950 652 370 45056 0 0 update-motd-hwe [ +0.000001] [ 13971] 0 13971 6468 2179 86016 0 0 hwe-support-sta [ +0.000001] [ 13979] 0 13979 2962 87 65536 0 0 sshd [ +0.000001] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cardano.service,task=cardano-node,pid=8845,uid=1000 [ +0.000031] Out of memory: Killed process 8845 (cardano-node) total-vm:1073777564kB, anon-rss:3707056kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:7344kB oom_score_adj:0 [ +0.188708] oom_reaper: reaped process 8845 (cardano-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB ```