prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.19k stars 2.36k forks source link

Netclass collector bug #3027

Closed pimpmyname2 closed 5 months ago

pimpmyname2 commented 5 months ago

Host operating system: output of uname -a

Linux gameserver01 6.8.10-x64v4-xanmod1 #0~20240517.g2e7da9e SMP PREEMPT_DYNAMIC Fri May 17 18:22:26 UTC x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

Node Exporter version 1.8.1

node_exporter command line flags

docker compose:

node_exporter:
    image: prom/node-exporter:latest
    restart: unless-stopped
    network_mode: host
    cpuset: "0,24"
    pid: host
    volumes:
      - '/:/host:ro,rslave'
    command:
      - '--path.rootfs=/host'
      - '--no-collector.netclass'
      - '--no-collector.netstat'
      - '--no-collector.softnet'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
      - '--web.listen-address=10.69.69.150:9100'

node_exporter log output

docker logs prometheus-docker-node_exporter-1
ts=2024-05-24T11:46:59.664Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.1, branch=HEAD, revision=400c3979931613db930ea035f39ce7b377cdbb5b)"
ts=2024-05-24T11:46:59.665Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.3, platform=linux/amd64, user=root@7afbff271a3f, date=20240521-18:36:22, tags=unknown)"
ts=2024-05-24T11:46:59.666Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(sys|proc|dev|host|etc)($|/)
ts=2024-05-24T11:46:59.666Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-05-24T11:46:59.666Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-05-24T11:46:59.666Z caller=diskstats_linux.go:265 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=fibrechannel
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=hwmon
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=infiniband
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=ipvs
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=nfs
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=nfsd
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=os
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=powersupplyclass
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=rapl
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=tapestats
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=thermal_zone
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=time
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=udp_queues
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=xfs
ts=2024-05-24T11:46:59.667Z caller=node_exporter.go:118 level=info collector=zfs
ts=2024-05-24T11:46:59.669Z caller=tls_config.go:313 level=info msg="Listening on" address=10.69.69.150:9100
ts=2024-05-24T11:46:59.669Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=10.69.69.150:9100

Are you running node_exporter in Docker?

Yes.

What did you do that produced an error?

This happens when i start the container.

What did you expect to see?

Before updating network adapter firmware everything was fine.

What did you see instead?

This could be relevant to the text below.

dmesg output

[   56.578872] BUG: unable to handle page fault for address: 000000000002bbe0
[   56.578897] #PF: supervisor write access in kernel mode
[   56.578905] #PF: error_code(0x0002) - not-present page
[   56.578911] PGD 10aae5067 P4D 10aae5067 PUD 11a46c067 PMD 0 
[   56.578922] Oops: 0002 [#1] PREEMPT SMP NOPTI
[   56.578934] CPU: 0 PID: 2130 Comm: node_exporter Not tainted 6.8.10-x64v4-xanmod1 #0~20240517.g2e7da9e
[   56.578948] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 02/22/2024
[   56.578959] RIP: 0010:native_queued_spin_lock_slowpath+0x191/0x1d0
[   56.578975] Code: 94 0f ae e8 8b 02 85 c0 74 f1 eb f5 c1 ef 12 83 e0 03 83 ef 01 48 c1 e0 05 48 63 ff 48 05 80 bb 02 00 48 03 04 fd e0 9c 67 8f <48> 89 08 8b 41 08 85 c0 75 0a 0f ae e8 8b 41 08 85 c0 74 f6 48 8b
[   56.579001] RSP: 0018:ffffb63e6636baf0 EFLAGS: 00010002
[   56.579013] RAX: 000000000002bbe0 RBX: 0000000000000246 RCX: ffff99c3ff62bb80
[   56.579027] RDX: ffffffffc0cb78e0 RSI: 0000000000040000 RDI: 0000000000003031
[   56.579041] RBP: ffffffffc0cb78e0 R08: 0000000000040000 R09: 0000000000000000
[   56.579055] R10: ffff99a5706da000 R11: ffff99c48b42f8c0 R12: ffffffffc0cb75e0
[   56.579069] R13: ffffb63e6636bbd0 R14: 0000000000000001 R15: ffffb63e6636bcf0
[   56.579083] FS:  000000c00005e898(0000) GS:ffff99c3ff600000(0000) knlGS:0000000000000000
[   56.579100] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.579113] CR2: 000000000002bbe0 CR3: 000000010d584004 CR4: 00000000007706f0
[   56.579127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   56.579140] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   56.579152] PKRU: 55555554
[   56.579160] Call Trace:
[   56.579169]  <TASK>
[   56.579178]  ? __die+0x1a/0x60
[   56.579195]  ? page_fault_oops+0x148/0x490
[   56.579210]  ? exc_page_fault+0x67/0xb0
[   56.579223]  ? asm_exc_page_fault+0x22/0x30
[   56.579240]  ? native_queued_spin_lock_slowpath+0x191/0x1d0
[   56.579255]  _raw_spin_lock_irqsave+0x34/0x40
[   56.579267]  __percpu_counter_sum+0xc/0x70
[   56.579281]  nfsd_show+0x4c/0x1d0 [nfsd]
[   56.579311]  seq_read_iter+0x117/0x470
[   56.579326]  seq_read+0xf9/0x140
[   56.579336]  proc_reg_read+0x51/0xa0
[   56.579350]  vfs_read+0xa3/0x340
[   56.579362]  ? __seccomp_filter+0x316/0x4d0
[   56.579379]  ksys_read+0x5e/0xe0
[   56.579389]  do_syscall_64+0x6c/0x110
[   56.579403]  ? __pte_offset_map+0x12/0x170
[   56.579415]  ? __mod_memcg_lruvec_state+0x8e/0x100
[   56.579430]  ? __lruvec_stat_mod_folio+0x62/0xa0
[   56.579443]  ? set_ptes.isra.0+0x28/0x90
[   56.579453]  ? do_anonymous_page+0x343/0x6c0
[   56.579465]  ? pmdp_collapse_flush+0x50/0x50
[   56.579800]  ? __handle_mm_fault+0xb39/0xe20
[   56.580050]  ? syscall_exit_to_user_mode+0x8b/0x180
[   56.580294]  ? __count_memcg_events+0x44/0xb0
[   56.580529]  ? count_memcg_events.constprop.0+0x1a/0x30
[   56.580755]  ? handle_mm_fault+0x95/0x320
[   56.580985]  ? do_user_addr_fault+0x2f3/0x670
[   56.581208]  ? exc_page_fault+0x67/0xb0
[   56.581435]  ? irqentry_exit_to_user_mode+0x5f/0x140
[   56.581651]  entry_SYSCALL_64_after_hwframe+0x6d/0x75
[   56.581871] RIP: 0033:0x40708e
[   56.582077] Code: 48 83 ec 38 e8 13 00 00 00 48 83 c4 38 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48
[   56.582507] RSP: 002b:000000c000143208 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
[   56.582719] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 000000000040708e
[   56.582935] RDX: 0000000000001000 RSI: 000000c00027f000 RDI: 0000000000000008
[   56.583151] RBP: 000000c000143248 R08: 0000000000000000 R09: 0000000000000000
[   56.583362] R10: 0000000000000000 R11: 0000000000000202 R12: 000000c000143378
[   56.583571] R13: 0000000000000000 R14: 000000c000080c40 R15: 3fffffffffffffff
[   56.583787]  </TASK>
[   56.583988] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc ipmi_ssif ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common overlay isst_if_common binfmt_misc nls_iso8859_1 nfit x86_pkg_temp_thermal intel_powerclamp hpilo kvm_intel kvm irqbypass rapl mei_me intel_cstate ioatdma mei intel_pch_thermal dca acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler cdc_eem acpi_tad usbnet mii mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua coretemp nfsd auth_rpcgss msr nfs_acl lockd grace efi_pstore sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 ses enclosure crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3
[   56.584045]  sha1_ssse3 i40e smartpqi nvme scsi_transport_sas nvme_core mgag200 nvme_auth lpc_ich xhci_pci i2c_algo_bit xhci_pci_renesas wmi aesni_intel crypto_simd cryptd
[   56.586850] CR2: 000000000002bbe0
[   56.587136] ---[ end trace 0000000000000000 ]---
[   56.698881] RIP: 0010:native_queued_spin_lock_slowpath+0x191/0x1d0
[   56.699257] Code: 94 0f ae e8 8b 02 85 c0 74 f1 eb f5 c1 ef 12 83 e0 03 83 ef 01 48 c1 e0 05 48 63 ff 48 05 80 bb 02 00 48 03 04 fd e0 9c 67 8f <48> 89 08 8b 41 08 85 c0 75 0a 0f ae e8 8b 41 08 85 c0 74 f6 48 8b
[   56.699883] RSP: 0018:ffffb63e6636baf0 EFLAGS: 00010002
[   56.700196] RAX: 000000000002bbe0 RBX: 0000000000000246 RCX: ffff99c3ff62bb80
[   56.700513] RDX: ffffffffc0cb78e0 RSI: 0000000000040000 RDI: 0000000000003031
[   56.700831] RBP: ffffffffc0cb78e0 R08: 0000000000040000 R09: 0000000000000000
[   56.701149] R10: ffff99a5706da000 R11: ffff99c48b42f8c0 R12: ffffffffc0cb75e0
[   56.701469] R13: ffffb63e6636bbd0 R14: 0000000000000001 R15: ffffb63e6636bcf0
[   56.701801] FS:  000000c00005e898(0000) GS:ffff99c3ff600000(0000) knlGS:0000000000000000
[   56.702127] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.702464] CR2: 000000000002bbe0 CR3: 000000010d584004 CR4: 00000000007706f0
[   56.702795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   56.703137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   56.703468] PKRU: 55555554
[   56.703807] note: node_exporter[2130] exited with irqs disabled
[   56.704183] note: node_exporter[2130] exited with preempt_count 1

The only solution for this would be adding '--no-collector.netclass' to my docker compose file. Also when this happens server completely freezes so i have to reset the server. I recently updated firmware on network adapter "HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adpt" from 11.1.4 to 11.1.5 which im pretty sure is the cause of this. Any ideas..?

discordianfish commented 5 months ago

Looks like a bug in the firmware, nothing we can do about it.

pimpmyname2 commented 5 months ago

Alright ill try talk to HPE.