microsoft / omi

Open Management Infrastructure
Other
361 stars 115 forks source link

Bad page state in process omicli #650

Closed johanburati closed 6 months ago

johanburati commented 4 years ago

I know there is a known kernel bug but since version 1.12.5 of the OmsAgentForLinux extension has been released, there are a lot of occurences of this bug.

Other processes are hitting the bug, but it seems it always start with omicli:

Oct 26 03:19:45 azurevm01 kernel: BUG: Bad page state in process omicli  pfn:4d77ff
Oct 26 03:20:07 azurevm01 kernel: BUG: Bad page state in process sh  pfn:3a84b5
Oct 26 03:20:07 azurevm01 kernel: BUG: Bad page state in process sh  pfn:3a84b5
Oct 26 03:20:08 azurevm01 kernel: BUG: Bad page state in process sh  pfn:3a84b5
Oct 26 03:20:08 azurevm01 kernel: BUG: Bad page state in process sh  pfn:3a84b5

I could see the issue on Centos/RHEL 7 with different kernels:

Will follow up internally.

Related:

johanburati commented 4 years ago

Found one occurence of the issue with OmsAgentForLinux-1.11.15 on a RHEL 7.5 VM (kernel 3.10.0-862.34.2.el7.x86_64) :

Sep  3 13:15:04 azurevm02 kernel: BUG: Bad page state in process omicli  pfn:3643ff
Sep  4 21:40:50 azurevm02 kernel: BUG: Bad page state in process python  pfn:3643ff
Sep  4 21:52:21 azurevm02 kernel: BUG: Bad page state: 762 messages suppressed

Also the kernel bug report is from Aug 2018, so not only limited to OmsAgentForLinux-1.12.5, but a lot more occurences of the bug since this release.

johanburati commented 4 years ago

For now if your VM is hitting this bug, it is recommended you downgrade to OmsAgentForLinux-1.11.15

You can do it using azure cli:

az vm extension delete --name OMSAgentForLinux --resource-group myResourceGroup --vm-name myVM
az vm extension set --name OMSAgentForLinux --publisher Microsoft.EnterpriseCloud.Monitoring --version 1.11.15 --protected-settings '{"workspaceKey":"myWorkspaceKey"}' --settings '{"workspaceId":"myWorkspaceId"}' --resource-group myResourceGroup --vm-name myVM --no-auto-upgrade
wagnst commented 4 years ago

Confirmed, I think we face exactly the same thing on RHEL 7.7 VM (Kernel 3.10.0-1062.4.1.el7.x86_64):

Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: BUG: Bad page state in process omicli  pfn:2663ff
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: BUG: Bad page state in process omicli  pfn:2663ff
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page:fffffdeec998ffc0 count:0 mapcount:1 mapping:          (null) index:0x7fbd75fff
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page flags: 0x2fffff0008001c(referenced|uptodate|dirty|swapbacked)
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page dumped because: nonzero mapcount
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Modules linked in: macsec tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack libcrc32c iptable_security joydev sb_edac iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_piix4 hv_utils ptp pps_core hv_balloon sg ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hv_netvsc hyperv_keyboard hid_hyperv scsi_tgt ata_piix crct10dif_pclmul hyperv_fb crct10dif_common libata crc32c_intel hv_vmbus floppy serio_raw dm_mirror dm_region_hash dm_log dm_mod
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: CPU: 11 PID: 74085 Comm: omicli Tainted: G    B          ------------ T 3.10.0-1062.4.1.el7.x86_64 #1
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: Call Trace:
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f778ba4>] dump_stack+0x19/0x1b
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f773f07>] bad_page.part.75+0xdc/0xf9
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c48ad>] free_pages_prepare+0x1ad/0x1d0
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c5304>] free_hot_cold_page+0x74/0x160
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1ca6d3>] __put_single_page+0x23/0x30
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1ca725>] put_page+0x45/0x60
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1102bd>] get_futex_key+0xdd/0x2c0
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f11060a>] futex_wake+0x6a/0x180
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1c54b5>] ? __free_pages+0x25/0x30
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1132fa>] do_futex+0x12a/0x5a0
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0de5e4>] ? task_cputime_scaled+0x44/0x80
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f1137f0>] SyS_futex+0x80/0x190
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f113110>] ? exit_robust_list+0x60/0x120
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0975c7>] mm_release+0xf7/0x140
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0a0e9e>] do_exit+0x16e/0xa50
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f0a17b7>] SyS_exit+0x17/0x20
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: [<ffffffff9f78bede>] system_call_fastpath+0x25/0x2a
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page:fffffdeec998ffc0 count:0 mapcount:0 mapping:          (null) index:0x7fbd75fff
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page flags: 0x2fffff0008001c(referenced|uptodate|dirty|swapbacked)
Nov  3 15:20:05 cafutststab1r11-rabbitmq-02 kernel: page dumped because: nonzero mapcount
eehret commented 4 years ago

Confirmed, this same issue is happening to us here too.

sbonds commented 4 years ago

We saw this issue trigger a cascading system failure as other processes had their page maps mangled immediately afterwards. Those stack traces looked like:

WARNING: CPU: 15 PID: 73763 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
list_del corruption. prev->next should be ffffda5031e17da0, but was ffffda50316dffa0
Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security rpcsec_gss_krb5 nfsv4 ext4 nfs fscache mbcache jbd2 drbg ansi_cprng cmac arc4 md4 nls_utf8 cifs ccm dns_resolver dm_mirror dm_region_hash dm_log dm_mod mlx4_en sb_edac iosf_mbi kvm_intel mlx4_core kvm devlink irqbypass crc32_pclmul ghash_clmulni_intel joydev aesni_intel lrw gf128mul glue_helper ablk_helper hv_utils cryptd ptp sg pps_core pci_hyperv hv_balloon i2c_piix4 pcspkr nfsd nfs_acl lockd auth_rpcgss grace binfmt_misc sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hv_netvsc hid_hyperv hyperv_keyboard scsi_tgt crct10dif_pclmul crct10dif_common crc32c_intel ata_piix hyperv_fb floppy libata serio_raw hv_vmbus <OMITTED>
CPU: 15 PID: 73763 Comm: abrtd Tainted: P    B      OE  ------------   3.10.0-957.5.1.el7.x86_64 #1
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
Call Trace:
 [<ffffffffaed61e41>] dump_stack+0x19/0x1b
 [<ffffffffae697688>] __warn+0xd8/0x100
 [<ffffffffae69770f>] warn_slowpath_fmt+0x5f/0x80
 [<ffffffffae995041>] __list_del_entry+0xa1/0xd0
 [<ffffffffae99507d>] list_del+0xd/0x30
 [<ffffffffae7bf50d>] free_pcppages_bulk+0x1bd/0x3a0
 [<ffffffffae7bf9b1>] free_hot_cold_page+0x141/0x160
 [<ffffffffae7c47f3>] __put_single_page+0x23/0x30
 [<ffffffffae7c4845>] put_page+0x45/0x60
 [<ffffffffae7ff51c>] free_page_and_swap_cache+0x4c/0x80
 [<ffffffffae7e5e90>] tlb_remove_table+0x90/0xc0
 [<ffffffffae677d77>] ___pte_free_tlb+0x47/0x50
 [<ffffffffae7e631e>] free_pgd_range+0x36e/0x620
 [<ffffffffae7e669d>] free_pgtables+0xcd/0x120
 [<ffffffffae7f2f56>] exit_mmap+0xc6/0x1a0
 [<ffffffffae694087>] mmput+0x67/0xf0
 [<ffffffffae8493d0>] flush_old_exec+0x4e0/0x970
 [<ffffffffae8a619c>] load_elf_binary+0x33c/0xd90
 [<ffffffffc0479064>] ? load_misc_binary+0x64/0x460 [binfmt_misc]
 [<ffffffffae91dd93>] ? ima_get_action+0x23/0x30
 [<ffffffffae91d2ae>] ? process_measurement+0x8e/0x250
 [<ffffffffae91d769>] ? ima_bprm_check+0x49/0x50
 [<ffffffffae848a5a>] search_binary_handler+0x9a/0x1c0
 [<ffffffffae84a166>] do_execve_common.isra.24+0x616/0x880
 [<ffffffffae84a669>] SyS_execve+0x29/0x30
 [<ffffffffaed75368>] stub_execve+0x48/0x80
---[ end trace b9eb9ba132b9309c ]---
sbonds commented 4 years ago

In case it helps with the diagnosis, this was seen on a 64GB database server just as a massive batch process was starting on a server which was rebooted the night before. It's pretty much a worst-case scenario for memory page allocation.

Others have also reported similar bugs when massive and sudden changes happen in memory allocation. That might make a good torture test for triggering the issue in the Azure host agent. For example: https://bugzilla.redhat.com/show_bug.cgi?id=1023635#c3

johanburati commented 4 years ago

Thanks for the info @sbonds The kernel bug seems related to hugepages and happens whatever transparent_hugepage is enabled or disabled.

@wagnst, @eehret, could you confirm if transparent_hugepage is enabled on your VMs ?

eehret commented 4 years ago

@johanburati Looks like yes. I'm using the Azure RHEL 7.7 marketplace image, by the way.

cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never

johanburati commented 4 years ago

@eehret, thank you for checking.

I have seen in a sosreport it is also happening when tranparent_hugepage is set to never

johanburati commented 4 years ago

There is a pre-release with a fix that disables transparent_hugepages only for the OMS agent.

johanburati commented 4 years ago

Looks like OmsAgentForLinux-1.12.15 is being pushed to the different regions.

If you have disabled auto-upgrade, use the following commands to upgrade:

az vm extension delete --name OMSAgentForLinux --resource-group myResourceGroup --vm-name myVM
az vm extension set --name OMSAgentForLinux --publisher Microsoft.EnterpriseCloud.Monitoring --version 1.12.15 --protected-settings '{"workspaceKey":"myWorkspaceKey"}' --settings '{"workspaceId":"myWorkspaceId"}' --resource-group myResourceGroup --vm-name myVM
sbonds commented 4 years ago

We have not seen this issue recur on the 1.12.15 version of the agent that self-disables THP.

GhanshamMahajan commented 4 years ago

I got the same error "BUG: Bad page state in process omicli pfn:1246bff"

OS : Red Hat Enterprise Linux Server release 7.7 (Maipo) Kernel : 3.10.0-1062.7.1.el7.x86_64 OmsAgentForLinux_1.12.15

kernel: BUG: Bad page state in process omicli pfn:1246bff kernel: page:fffff2b8c91affc0 count:0 mapcount:1 mapping: (null) index:0x7f67fedff kernel: page flags: 0x2fffff00080018(uptodate|dirty|swapbacked) kernel: page dumped because: nonzero mapcount kernel: Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack libcrc32c iptable_security drbg ansi_cprng cmac arc4 md4 nls_utf8 cifs ccm dns_resolver binfmt_misc joydev hv_utils ptp pps_core hv_balloon sg sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper i2c_piix4 pcspkr cryptd ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi hv_storvsc scsi_transport_fc hid_hyperv hv_netvsc hyperv_keyboard scsi_tgt ata_piix crct10dif_pclmul hyperv_fb crct10dif_common libata crc32c_intel hv_vmbus floppy serio_raw dm_mirror dm_region_hash dm_log dm_mod kernel: CPU: 12 PID: 117633 Comm: omicli Tainted: G B ------------ T 3.10.0-1062.7.1.el7.x86_64 #1 kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017 kernel: Call Trace: kernel: [] dump_stack+0x19/0x1b kernel: [] bad_page.part.75+0xdc/0xf9 kernel: [] free_pages_prepare+0x1ad/0x1d0 kernel: [] free_hot_cold_page+0x74/0x160 kernel: [] __put_single_page+0x23/0x30 kernel: [] put_page+0x45/0x60 kernel: [] split_huge_page+0x387/0x880 kernel: [] split_huge_page_to_list+0x76/0xf0 kernel: [] split_huge_page_pmd+0x1d0/0x5c0 kernel: [] ? filemap_fault+0x17e/0x490 kernel: [] unmap_page_range+0xc63/0xc80 kernel: [] unmap_single_vma+0x81/0xf0 kernel: [] zap_page_range+0xc5/0x130 kernel: [] SyS_madvise+0x49d/0xac0 kernel: [] system_call_fastpath+0x25/0x2a sh: abrt-dump-oops: Found oopses: 1 sh: abrt-dump-oops: Creating problem directories sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on abrt-dump-oops: Reported 1 kernel oopses to Abrt kernel: TECH PREVIEW: eBPF syscall may not be fully supported.#012Please review provided documentation for limitations.

Is there any fix available for this?

sbonds commented 4 years ago

Yikes! That agent version was supposed to fix this.

OmsAgentForLinux_1.12.15

johanburati commented 4 years ago

Unless RedHat fix the kernel bug, I am afraid there will always be a chance for the system to hit this bug.

But this is the first time I see a report of the bug since OmsAgentForLinux_1.12.15 has been released so I guess the chance of it occuring is low.

@sbonds, I think you have quite a few servers, did any of those hit the bug after OmsAgentForLinux was upgraded to version 1.12.15 ?

@GhanshamMahajan, I would suggest you open a ticket with Azure support for investigation, if you haven't done it already.

GhanshamMahajan commented 4 years ago

Yes, I have raise the ticket with Azure. VM was stuck and showing below message on Azure Serial Console before reboot hyperv_fb: Unable to send packet via vmbus After reboot the server I saw above logs "kernel: BUG: Bad page state in process omicli pfn:1246bff"

I will be update here once get the Root Cause.

sbonds commented 4 years ago

So far we have not seen this on any of our servers running 1.12.15 or newer. I notice that the bug is tracked under the CentOS umbrella. Based on this RedHat article (login required) https://access.redhat.com/solutions/4542021 I can see that the bug is tracked via RedHat internal private Bugzilla 1674266, but I have no visibility into its progress there. As our RedHat support is via Azure support, I doubt I can get a direct answer from RedHat on this either.

I have regular meetings with Microsoft about this and other agent issues. I'll mention that the workaround isn't completely effective and see what I can to to help raise awareness.

johanburati commented 4 years ago

Thanks @sbonds, @GhanshamMahajan

Until RedHat release a fix , maybe you could try to disable THP on the whole system: https://access.redhat.com/solutions/1320153

johanburati commented 4 years ago

FYI RedHat has fixed the issue:

sbonds commented 4 years ago

Nice! Now we just need to get outages for patching...

JumpingYang001 commented 6 months ago

The issue has been fixed by Red Hat OS. just close it.