Closed johanburati closed 6 months ago
Found one occurence of the issue with OmsAgentForLinux-1.11.15 on a RHEL 7.5 VM (kernel 3.10.0-862.34.2.el7.x86_64) :
Sep 3 13:15:04 azurevm02 kernel: BUG: Bad page state in process omicli pfn:3643ff
Sep 4 21:40:50 azurevm02 kernel: BUG: Bad page state in process python pfn:3643ff
Sep 4 21:52:21 azurevm02 kernel: BUG: Bad page state: 762 messages suppressed
Also the kernel bug report is from Aug 2018, so not only limited to OmsAgentForLinux-1.12.5, but a lot more occurences of the bug since this release.
For now if your VM is hitting this bug, it is recommended you downgrade to OmsAgentForLinux-1.11.15
You can do it using azure cli:
az vm extension delete --name OMSAgentForLinux --resource-group myResourceGroup --vm-name myVM
az vm extension set --name OMSAgentForLinux --publisher Microsoft.EnterpriseCloud.Monitoring --version 1.11.15 --protected-settings '{"workspaceKey":"myWorkspaceKey"}' --settings '{"workspaceId":"myWorkspaceId"}' --resource-group myResourceGroup --vm-name myVM --no-auto-upgrade
Confirmed, I think we face exactly the same thing on RHEL 7.7 VM (Kernel 3.10.0-1062.4.1.el7.x86_64):
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: BUG: Bad page state in process omicli pfn:2663ff
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: BUG: Bad page state in process omicli pfn:2663ff
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page:fffffdeec998ffc0 count:0 mapcount:1 mapping: (null) index:0x7fbd75fff
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page flags: 0x2fffff0008001c(referenced|uptodate|dirty|swapbacked)
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page dumped because: nonzero mapcount
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Modules linked in: macsec tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack libcrc32c iptable_security joydev sb_edac iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_piix4 hv_utils ptp pps_core hv_balloon sg ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hv_netvsc hyperv_keyboard hid_hyperv scsi_tgt ata_piix crct10dif_pclmul hyperv_fb crct10dif_common libata crc32c_intel hv_vmbus floppy serio_raw dm_mirror dm_region_hash dm_log dm_mod
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: CPU: 11 PID: 74085 Comm: omicli Tainted: G B ------------ T 3.10.0-1062.4.1.el7.x86_64 #1
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Call Trace:
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f778ba4>] dump_stack+0x19/0x1b
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f773f07>] bad_page.part.75+0xdc/0xf9
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c48ad>] free_pages_prepare+0x1ad/0x1d0
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c5304>] free_hot_cold_page+0x74/0x160
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1ca6d3>] __put_single_page+0x23/0x30
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1ca725>] put_page+0x45/0x60
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1102bd>] get_futex_key+0xdd/0x2c0
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f11060a>] futex_wake+0x6a/0x180
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c54b5>] ? __free_pages+0x25/0x30
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1132fa>] do_futex+0x12a/0x5a0
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0de5e4>] ? task_cputime_scaled+0x44/0x80
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1137f0>] SyS_futex+0x80/0x190
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f113110>] ? exit_robust_list+0x60/0x120
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0975c7>] mm_release+0xf7/0x140
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0a0e9e>] do_exit+0x16e/0xa50
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0a17b7>] SyS_exit+0x17/0x20
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f78bede>] system_call_fastpath+0x25/0x2a
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page:fffffdeec998ffc0 count:0 mapcount:0 mapping: (null) index:0x7fbd75fff
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page flags: 0x2fffff0008001c(referenced|uptodate|dirty|swapbacked)
Nov 3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page dumped because: nonzero mapcount
Confirmed, this same issue is happening to us here too.
We saw this issue trigger a cascading system failure as other processes had their page maps mangled immediately afterwards. Those stack traces looked like:
WARNING: CPU: 15 PID: 73763 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
list_del corruption. prev->next should be ffffda5031e17da0, but was ffffda50316dffa0
Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security rpcsec_gss_krb5 nfsv4 ext4 nfs fscache mbcache jbd2 drbg ansi_cprng cmac arc4 md4 nls_utf8 cifs ccm dns_resolver dm_mirror dm_region_hash dm_log dm_mod mlx4_en sb_edac iosf_mbi kvm_intel mlx4_core kvm devlink irqbypass crc32_pclmul ghash_clmulni_intel joydev aesni_intel lrw gf128mul glue_helper ablk_helper hv_utils cryptd ptp sg pps_core pci_hyperv hv_balloon i2c_piix4 pcspkr nfsd nfs_acl lockd auth_rpcgss grace binfmt_misc sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hv_netvsc hid_hyperv hyperv_keyboard scsi_tgt crct10dif_pclmul crct10dif_common crc32c_intel ata_piix hyperv_fb floppy libata serio_raw hv_vmbus <OMITTED>
CPU: 15 PID: 73763 Comm: abrtd Tainted: P B OE ------------ 3.10.0-957.5.1.el7.x86_64 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
Call Trace:
[<ffffffffaed61e41>] dump_stack+0x19/0x1b
[<ffffffffae697688>] __warn+0xd8/0x100
[<ffffffffae69770f>] warn_slowpath_fmt+0x5f/0x80
[<ffffffffae995041>] __list_del_entry+0xa1/0xd0
[<ffffffffae99507d>] list_del+0xd/0x30
[<ffffffffae7bf50d>] free_pcppages_bulk+0x1bd/0x3a0
[<ffffffffae7bf9b1>] free_hot_cold_page+0x141/0x160
[<ffffffffae7c47f3>] __put_single_page+0x23/0x30
[<ffffffffae7c4845>] put_page+0x45/0x60
[<ffffffffae7ff51c>] free_page_and_swap_cache+0x4c/0x80
[<ffffffffae7e5e90>] tlb_remove_table+0x90/0xc0
[<ffffffffae677d77>] ___pte_free_tlb+0x47/0x50
[<ffffffffae7e631e>] free_pgd_range+0x36e/0x620
[<ffffffffae7e669d>] free_pgtables+0xcd/0x120
[<ffffffffae7f2f56>] exit_mmap+0xc6/0x1a0
[<ffffffffae694087>] mmput+0x67/0xf0
[<ffffffffae8493d0>] flush_old_exec+0x4e0/0x970
[<ffffffffae8a619c>] load_elf_binary+0x33c/0xd90
[<ffffffffc0479064>] ? load_misc_binary+0x64/0x460 [binfmt_misc]
[<ffffffffae91dd93>] ? ima_get_action+0x23/0x30
[<ffffffffae91d2ae>] ? process_measurement+0x8e/0x250
[<ffffffffae91d769>] ? ima_bprm_check+0x49/0x50
[<ffffffffae848a5a>] search_binary_handler+0x9a/0x1c0
[<ffffffffae84a166>] do_execve_common.isra.24+0x616/0x880
[<ffffffffae84a669>] SyS_execve+0x29/0x30
[<ffffffffaed75368>] stub_execve+0x48/0x80
---[ end trace b9eb9ba132b9309c ]---
In case it helps with the diagnosis, this was seen on a 64GB database server just as a massive batch process was starting on a server which was rebooted the night before. It's pretty much a worst-case scenario for memory page allocation.
Others have also reported similar bugs when massive and sudden changes happen in memory allocation. That might make a good torture test for triggering the issue in the Azure host agent. For example: https://bugzilla.redhat.com/show_bug.cgi?id=1023635#c3
Thanks for the info @sbonds The kernel bug seems related to hugepages and happens whatever transparent_hugepage is enabled or disabled.
@wagnst, @eehret, could you confirm if transparent_hugepage is enabled on your VMs ?
@johanburati Looks like yes. I'm using the Azure RHEL 7.7 marketplace image, by the way.
cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never
@eehret, thank you for checking.
I have seen in a sosreport it is also happening when tranparent_hugepage is set to never
There is a pre-release with a fix that disables transparent_hugepages only for the OMS agent.
Looks like OmsAgentForLinux-1.12.15 is being pushed to the different regions.
If you have disabled auto-upgrade, use the following commands to upgrade:
az vm extension delete --name OMSAgentForLinux --resource-group myResourceGroup --vm-name myVM
az vm extension set --name OMSAgentForLinux --publisher Microsoft.EnterpriseCloud.Monitoring --version 1.12.15 --protected-settings '{"workspaceKey":"myWorkspaceKey"}' --settings '{"workspaceId":"myWorkspaceId"}' --resource-group myResourceGroup --vm-name myVM
We have not seen this issue recur on the 1.12.15 version of the agent that self-disables THP.
I got the same error "BUG: Bad page state in process omicli pfn:1246bff"
OS : Red Hat Enterprise Linux Server release 7.7 (Maipo) Kernel : 3.10.0-1062.7.1.el7.x86_64 OmsAgentForLinux_1.12.15
kernel: BUG: Bad page state in process omicli pfn:1246bff
kernel: page:fffff2b8c91affc0 count:0 mapcount:1 mapping: (null) index:0x7f67fedff
kernel: page flags: 0x2fffff00080018(uptodate|dirty|swapbacked)
kernel: page dumped because: nonzero mapcount
kernel: Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack libcrc32c iptable_security drbg ansi_cprng cmac arc4 md4 nls_utf8 cifs ccm dns_resolver binfmt_misc joydev hv_utils ptp pps_core hv_balloon sg sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper i2c_piix4 pcspkr cryptd ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hid_hyperv hv_netvsc hyperv_keyboard scsi_tgt ata_piix crct10dif_pclmul hyperv_fb crct10dif_common libata crc32c_intel hv_vmbus floppy serio_raw dm_mirror dm_region_hash dm_log dm_mod
kernel: CPU: 12 PID: 117633 Comm: omicli Tainted: G B ------------ T 3.10.0-1062.7.1.el7.x86_64 #1
kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
kernel: Call Trace:
kernel: [
Is there any fix available for this?
Yikes! That agent version was supposed to fix this.
OmsAgentForLinux_1.12.15
Unless RedHat fix the kernel bug, I am afraid there will always be a chance for the system to hit this bug.
But this is the first time I see a report of the bug since OmsAgentForLinux_1.12.15 has been released so I guess the chance of it occuring is low.
@sbonds, I think you have quite a few servers, did any of those hit the bug after OmsAgentForLinux was upgraded to version 1.12.15 ?
@GhanshamMahajan, I would suggest you open a ticket with Azure support for investigation, if you haven't done it already.
Yes, I have raise the ticket with Azure. VM was stuck and showing below message on Azure Serial Console before reboot hyperv_fb: Unable to send packet via vmbus After reboot the server I saw above logs "kernel: BUG: Bad page state in process omicli pfn:1246bff"
I will be update here once get the Root Cause.
So far we have not seen this on any of our servers running 1.12.15 or newer. I notice that the bug is tracked under the CentOS umbrella. Based on this RedHat article (login required) https://access.redhat.com/solutions/4542021 I can see that the bug is tracked via RedHat internal private Bugzilla 1674266, but I have no visibility into its progress there. As our RedHat support is via Azure support, I doubt I can get a direct answer from RedHat on this either.
I have regular meetings with Microsoft about this and other agent issues. I'll mention that the workaround isn't completely effective and see what I can to to help raise awareness.
Thanks @sbonds, @GhanshamMahajan
Until RedHat release a fix , maybe you could try to disable THP on the whole system: https://access.redhat.com/solutions/1320153
FYI RedHat has fixed the issue:
Nice! Now we just need to get outages for patching...
The issue has been fixed by Red Hat OS. just close it.
I know there is a known kernel bug but since version 1.12.5 of the OmsAgentForLinux extension has been released, there are a lot of occurences of this bug.
Other processes are hitting the bug, but it seems it always start with omicli:
I could see the issue on Centos/RHEL 7 with different kernels:
Will follow up internally.
Related: