Open jkevinezz opened 10 months ago
Hi, as a volunteer here, have you tried to open a SR @ https://www.vmware.com/go/customerconnect ? See SAM offering. For TKG, you could collect all logfiles in reference to kb90319. Also, have a look to the VMware Tanzu Compliance Product Documentation. Could be a subcomponent bug and/or resource limitation related without burst possibility, but without logs and compliance status that's a guess only. Hope this helps.
Yes we opened multiple cases with vmware support with tanzu team and they have stated thar cgroup memory exhaustions are from the photon kernel and that we should open a bug with the photon OS team and that’s why we opened this bug report. We have had over 8 cases.
Please let us know which logs you need from the photon os and how to get them and we can provide you those logs from the photon 3 based vm which we use as a tanzu node
Please reply to all so my team can be aware of these email exchanges
Thx Julius
From: dcasota @.> Sent: Thursday, January 18, 2024 11:59 Amer 8 To: vmware/photon @.> Cc: Julius Kevinezz @.>; Author @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
Hi, as a volunteer here, have you tried to open a SR @ https://www.vmware.com/go/customerconnecthttps://urldefense.com/v3/__https:/www.vmware.com/go/customerconnect__;!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4QkQ6hL2EN$ ? See SAMhttps://urldefense.com/v3/__https:/www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/docs/vmware-support-account-manager-specific-program-document.pdf__;!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4QkV5E7Sj2$ offering. For TKG, you could collect all logfiles in reference to kb90319https://urldefense.com/v3/__https:/kb.vmware.com/s/article/90319__;!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4QkSq_nOnT$. Also, have a look to the VMware Tanzu Compliance Product Documentationhttps://urldefense.com/v3/__https:/docs.vmware.com/en/VMware-Tanzu-Compliance/index.html__;!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4QkUGm1Yqa$. Could be a subcomponent bug and/or resource limitation related without burst possibility, but without logs and compliance status that's a guess only. Hope this helps.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/vmware/photon/issues/1535*issuecomment-1898866474__;Iw!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4Qkb7z86EA$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AILMTBCQBW7TVQRY2IEUYM3YPFIEDAVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYHA3DMNBXGQ__;!!OxqwZwwAVvYGpJMosQ!ew64vGDCA2jx95OXXYoNMyZKa92Z11wC_Y7l9oacjMiDGXV_Q0Vvg-RxdqiVWJ-iJVrUxCAVB_r3zf4QkTOZBCVf$. You are receiving this because you authored the thread.Message ID: @.**@.>>
orchestrating 8 cases++ /cc @Vasavisirnapalli
@jkevinezz ,
Which kernel version are you using?
Do you see cgroup.memory=nokmem
in cat /proc/cmdline
?
Could you please share kernel logs.
Thanks.
Could you please tell me how to gather kernel logs from photon 3.0
From: jaankit @.> Sent: Thursday, January 18, 2024 1:08 PM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
Which kernel version are you using? Do you see cgroup.memory=nokmem in cat /proc/cmdline? Could you please share kernel logs.
Thanks.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/vmware/photon/issues/1535*issuecomment-1898972429__;Iw!!OxqwZwwAVvYGpJMosQ!bogzKpmZ6W6JGr_ppo1tWRYFEyG0EHEAG2kU9yBw2l1iUPLTQRMKxNtyx_6uVitHb6GwWr4EqytFeWiXe-wJq30-$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AILMTBCQGGNOIMFMWYIP67LYPFQGTAVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYHE3TENBSHE__;!!OxqwZwwAVvYGpJMosQ!bogzKpmZ6W6JGr_ppo1tWRYFEyG0EHEAG2kU9yBw2l1iUPLTQRMKxNtyx_6uVitHb6GwWr4EqytFeWiXezyrLPmv$. You are receiving this because you were mentioned.Message ID: @.**@.>>
Here is a log snippet e saved from one of the photon 3.0 Tanzu node random reboot.
When you say kernel logs, you just want the vm logs right?
Thx Julius
From: jaankit @.> Sent: Thursday, January 18, 2024 1:08 PM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
Which kernel version are you using? Do you see cgroup.memory=nokmem in cat /proc/cmdline? Could you please share kernel logs.
Thanks.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/vmware/photon/issues/1535*issuecomment-1898972429__;Iw!!OxqwZwwAVvYGpJMosQ!bogzKpmZ6W6JGr_ppo1tWRYFEyG0EHEAG2kU9yBw2l1iUPLTQRMKxNtyx_6uVitHb6GwWr4EqytFeWiXe-wJq30-$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AILMTBCQGGNOIMFMWYIP67LYPFQGTAVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYHE3TENBSHE__;!!OxqwZwwAVvYGpJMosQ!bogzKpmZ6W6JGr_ppo1tWRYFEyG0EHEAG2kU9yBw2l1iUPLTQRMKxNtyx_6uVitHb6GwWr4EqytFeWiXezyrLPmv$. You are receiving this because you were mentioned.Message ID: @.**@.>>
I cannot see any log snippet.
Check kernel version via uname -a
.
kernel logs is dmesg
command output.
Also run cat /proc/cmdline
to check if cgroup.memory=nokmem parameter is present.
We suspect it can be older kernel issue which was fixed by
https://github.com/vmware/photon/commit/1c4e9360cc516c9e9a086b441c9b4df63df3449a
Thank you
We run the same photon 3.0 in 3 different datacenters and it only seems to be happening in 1 datacenter but we will check what you have posted below and get back
Thank you all
From: Prashant Singh Chauhan @.> Sent: Friday, January 19, 2024 1:19:42 AM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
I cannot see any log snippet. Check kernel version via uname -a. kernel logs is dmesg command output. Also run cat /proc/cmdline to check if cgroup.memory=nokmem parameter is present. We suspect it can be older kernel issue which was fixed by 1c4e936https://urldefense.com/v3/__https://github.com/vmware/photon/commit/1c4e9360cc516c9e9a086b441c9b4df63df3449a__;!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBMWEQZHh$
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/vmware/photon/issues/1535*issuecomment-1899831350__;Iw!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBOmMLjYt$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AILMTBGREZ7UHLIN2OZS3STYPIF75AVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHAZTCMZVGA__;!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBFHSgon4$. You are receiving this because you were mentioned.Message ID: @.***>
@.*** [ ~ ]$ sudo su root [ /home/capv ]# uname -a Linux ts-sharedplatform-ash-prod-md1-7cd78d79b9-w2gmm 4.19.189-5.ph3 #1-photon SMP Thu May 13 16:00:29 UTC 2021 x86_64 GNU/Linux
root [ /home/capv ]# cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-4.19.189-5.ph3 root=PARTUUID=aac1ba00-26c4-414e-9662-611408371055 init=/lib/systemd/systemd ro loglevel=3 quiet no-vmw-sta cgroup.memory=nokmem net.ifnames=0 plymouth.enable=0 systemd.legacy_systemd_cgroup_controller=yes
From: Prashant Singh Chauhan @.> Sent: Friday, January 19, 2024 1:19:42 AM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
I cannot see any log snippet. Check kernel version via uname -a. kernel logs is dmesg command output. Also run cat /proc/cmdline to check if cgroup.memory=nokmem parameter is present. We suspect it can be older kernel issue which was fixed by 1c4e936https://urldefense.com/v3/__https://github.com/vmware/photon/commit/1c4e9360cc516c9e9a086b441c9b4df63df3449a__;!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBMWEQZHh$
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/vmware/photon/issues/1535*issuecomment-1899831350__;Iw!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBOmMLjYt$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AILMTBGREZ7UHLIN2OZS3STYPIF75AVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJZHAZTCMZVGA__;!!OxqwZwwAVvYGpJMosQ!dqchmuaYd0CCIC1Jz1FYQiM59iMImtPGZPuo_Fdq5xOSxwWo3Ry2hqFf8oePyAEhi_S6AmIb3LUXq8TYBFHSgon4$. You are receiving this because you were mentioned.Message ID: @.***>
@prashant1221 what about 'kernel panic' fix https://github.com/vmware/photon/commit/f029de15b4453daa80fb4edd4b81b0a9eb021f96, in correlation to 'node random reboot' and features eligible for 3 datacenter ? Here a patch filtering attempt using keywords.
Also, can you please share the output of slabtop -sc --once
in the nodes which experiance this issue often.
@jkevinezz fyi
Accordingly to Photon OS – Planned End of Support Schedule, an upgrade of Photon OS 3 is recommended.
Patch/Update as continuous action has been addressed in the last years by introducing somewhat a bunch of improvements.
The actual docs provide a short description about the upgrade process which is very easy btw.
Inplace migrations including fips mode, bios->uefi, kernel, docker, kubernetes, etc. afaik were never drilled down systematically for the docs of the open-source version of Photon OS. My bad, the doc attempt here was somewhat insufficient and no other attempts were been made since then. As every software is a continuous additions/deletions pulse, yes there were a few issues as well, e.g. 1244, 1226, 1234, 1420.
Having said that, with your VMware Support Account Manager, populating a migration path solution should be considered, using
As soon as the content of your Tanzu K8S nodes is sort of sbom'ified and populated for the migration, planning it for the maintenance schedule gets easy.
Yes we are in the process but it takes time we are a huge environment so we need to understand what’s happening in photon 3.0
From: dcasota @.> Sent: Friday, January 19, 2024 8:20:56 AM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
Accordingly to Photon OS – Planned End of Support Schedulehttps://urldefense.com/v3/__https://blogs.vmware.com/vsphere/2022/01/photon-1-x-end-of-support-announcement.html__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gNVNj-_0$, an upgrade of Photon OS 3 is recommended.
Patch/Update as continuous action has been addressed in the last years by introducing somewhat a bunch of improvements.
The actual docs provide a short description about the upgrade process which is very easy btw.
Inplace migrations including fips mode, bios->uefi, kernel, docker, kubernetes, etc. afaik were never drilled down systematically for the docs of the open-source version of Photon OS. My bad, the doc attempt herehttps://urldefense.com/v3/__https://github.com/vmware/photon/pull/1478/files*diff-b60ebb61421e67fd0f8d18547fce1872f0a2381099f747a86003ef8654a46cad__;Iw!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gDIiX_WL$ was somewhat insufficient and no other attempts were been made since then. As every software is a continuous additions/deletions pulse, yes there were a few issues as well, e.g. 1244https://urldefense.com/v3/__https://github.com/vmware/photon/issues/1244__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gP9f8wFl$, 1226https://urldefense.com/v3/__https://github.com/vmware/photon/issues/1226__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gJcm15H2$, 1234https://urldefense.com/v3/__https://github.com/vmware/photon/issues/1234__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gHQ31gxu$, 1420https://urldefense.com/v3/__https://github.com/vmware/photon/issues/1420__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gPJO1Miw$.
Having said that, with your VMware Support Account Manager, populating a migration path solution should be considered, using
As soon as the content of your Tanzu K8S nodes is sort of sbom'ified and populated for the migration, planning it for the maintenance schedule gets easy.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/vmware/photon/issues/1535*issuecomment-1900416733__;Iw!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gAlEzuqI$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AILMTBA67YMICCWXZXIZKULYPJXLRAVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGQYTMNZTGM__;!!OxqwZwwAVvYGpJMosQ!ZRKJepFkDtZgSIT5LPgvyKjDuaBiJUKJfffA1K89KsZcFB_WToTsG99Omunlwo5l5I210hx7gaNI03q1gMGuL8vj$. You are receiving this because you were mentioned.Message ID: @.***>
@. ~ % ssh @*.**@*.> @. [ ~ ]$ sudo su root [ /home/capv ]# root [ /home/capv ]# root [ /home/capv ]# slabtop -sc --once Active / Total Objects (% used) : 17171204 / 17313766 (99.2%) Active / Total Slabs (% used) : 914338 / 914629 (100.0%) Active / Total Caches (% used) : 107 / 135 (79.3%) Active / Total Size (% used) : 3345421.32K / 3394139.04K (98.6%) Minimum / Average / Maximum Object : 0.02K / 0.20K / 4096.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 670320 670192 99% 1.05K 223440 3 893760K ext4_inode_cache 8664357 8578099 99% 0.10K 222163 39 888652K buffer_head 2282826 2279920 99% 0.19K 108706 21 434824K dentry 322629 322627 99% 1.05K 107543 3 430172K nfs_inode_cache 264138 262238 99% 0.56K 37734 7 150936K radix_tree_node 126452 125853 99% 1.00K 31613 4 126452K kmalloc-1024 166170 166059 99% 0.58K 27695 6 110780K inode_cache 137094 137016 99% 0.66K 22849 6 91396K ovl_inode 164224 163864 99% 0.50K 20528 8 82112K kmalloc-512 283360 281753 99% 0.25K 17710 16 70840K skbuff_head_cache 370923 370789 99% 0.19K 17663 21 70652K kmalloc-192 879616 872108 99% 0.06K 13744 64 54976K kmalloc-64 1319967 1311072 99% 0.04K 13333 99 53332K ext4_extent_status 2436 2370 97% 10.94K 2436 1 38976K task_struct 283840 282911 99% 0.12K 8870 32 35480K kmalloc-128 540351 538860 99% 0.06K 8577 63 34308K dmaengine-unmap-2 182144 181804 99% 0.12K 5692 32 22768K kernfs_node_cache 164064 157068 95% 0.12K 5127 32 20508K kmalloc-96 10 1 10% 2048.00K 10 1 20480K kmalloc-2097152 16734 16529 98% 0.65K 2789 6 11156K proc_inode_cache 272304 269634 99% 0.03K 2196 124 8784K kmalloc-32 2077 2020 97% 4.00K 2077 1 8308K kmalloc-4096 28368 28137 99% 0.25K 1773 16 7092K kmalloc-256 6 2 33% 1024.00K 6 1 6144K kmalloc-1048576 3036 2906 95% 2.00K 1518 2 6072K kmalloc-2048 29421 29299 99% 0.19K 1401 21 5604K cred_jar 6534 6322 96% 0.68K 594 11 4752K shmem_inode_cache 7 1 14% 512.00K 7 1 3584K kmalloc-524288 12736 8444 66% 0.25K 796 16 3184K filp 6180 5891 95% 0.38K 618 10 2472K mnt_cache 3264 3181 97% 0.62K 544 6 2176K sock_inode_cache 233 222 95% 8.00K 233 1 1864K kmalloc-8192 57 24 42% 32.00K 57 1 1824K kmalloc-32768 648 622 95% 2.25K 216 3 1728K TCPv6 6 2 33% 256.00K 6 1 1536K kmalloc-262144 6032 5061 83% 0.25K 377 16 1508K nf_conntrack 22848 22448 98% 0.06K 357 64 1428K anon_vma_chain 17700 17449 98% 0.08K 354 50 1416K anon_vma 531 523 98% 2.06K 177 3 1416K sighand_cache 10 1 10% 128.00K 10 1 1280K kmalloc-131072 158 115 72% 8.00K 158 1 1264K biovec-max 992 960 96% 0.94K 248 4 992K RAW 7584 5319 70% 0.12K 237 32 948K pid 351 337 96% 2.12K 117 3 936K TCP 805 745 92% 1.06K 115 7 920K signal_cache 56 41 73% 16.00K 56 1 896K kmalloc-16384 3472 1400 40% 0.25K 217 16 868K pool_workqueue 11 9 81% 64.00K 11 1 704K kmalloc-65536 609 601 98% 1.12K 87 7 696K RAWv6 168 168 100% 4.00K 168 1 672K names_cache 3465 3365 97% 0.19K 165 21 660K proc_dir_entry 536 457 85% 1.00K 134 4 536K UNIX 627 576 91% 0.69K 57 11 456K files_cache 2268 1821 80% 0.19K 108 21 432K dmaengine-unmap-16 5432 4842 89% 0.07K 97 56 388K Acpi-Operand 380 380 100% 1.00K 95 4 380K mm_struct 1547 1233 79% 0.23K 91 17 364K tw_sock_TCPv6 2408 2400 99% 0.14K 86 28 344K ext4_groupinfo_4k 632 540 85% 0.50K 79 8 316K skbuff_fclone_cache 2048 1746 85% 0.12K 64 32 256K secpath_cache 29 29 100% 5.75K 29 1 232K net_namespace 4482 4282 95% 0.05K 54 83 216K ftrace_event_field 2856 2431 85% 0.07K 51 56 204K eventpoll_pwq 867 764 88% 0.23K 51 17 204K tw_sock_TCP 1472 766 52% 0.12K 46 32 184K scsi_sense_cache 2070 1924 92% 0.09K 45 46 180K trace_event_file 4257 3858 90% 0.04K 43 99 172K Acpi-Namespace 216 141 65% 0.62K 36 6 144K task_group 850 590 69% 0.16K 34 25 136K sigqueue 276 146 52% 0.32K 23 12 92K taskstats 1173 1078 91% 0.08K 23 51 92K inotify_inode_mark 33 26 78% 2.40K 11 3 88K request_queue 40 34 85% 2.00K 20 2 80K biovec-128 57 30 52% 1.25K 19 3 76K UDPv6 663 406 61% 0.10K 17 39 68K blkdev_ioc 1008 775 76% 0.06K 16 63 64K fs_cache 60 30 50% 0.81K 15 4 60K bdev_cache 195 163 83% 0.29K 15 13 60K request_sock_TCP 224 135 60% 0.25K 14 16 56K dquot 192 134 69% 0.25K 12 16 48K kmem_cache 781 496 63% 0.05K 11 71 44K Acpi-Parse 396 174 43% 0.11K 11 36 44K jbd2_journal_head 40 37 92% 0.94K 10 4 40K mqueue_inode_cache 10 4 40% 4.00K 10 1 40K sgpool-128 1467 1147 78% 0.02K 9 163 36K fsnotify_mark_connector 152 72 47% 0.20K 8 19 32K ip4-frags 32 32 100% 0.88K 8 4 32K nfs_read_data 693 449 64% 0.04K 7 99 28K pde_opener 140 9 6% 0.20K 7 20 28K file_lock_cache 448 132 29% 0.06K 7 64 28K ext4_io_end 45 29 64% 0.43K 5 9 20K uts_namespace 20 13 65% 1.00K 5 4 20K biovec-64 415 269 64% 0.05K 5 83 20K jbd2_journal_handle 48 32 66% 0.31K 4 12 16K xfrm_dst_cache 96 71 73% 0.12K 3 32 12K ext4_allocation_context 30 15 50% 0.26K 2 15 8K numa_policy 7 1 14% 1.06K 1 7 8K dmaengine-unmap-128 3 1 33% 2.06K 1 3 8K dmaengine-unmap-256 12 3 25% 0.60K 2 6 8K hugetlbfs_inode_cache 142 59 41% 0.05K 2 71 8K mbcache 248 2 0% 0.03K 2 124 8K xfs_ifork 7 1 14% 1.12K 1 7 8K PINGv6 11 4 36% 0.69K 1 11 8K nfs_commit_data 51 27 52% 0.08K 1 51 4K Acpi-State 5 1 20% 0.75K 1 5 4K dax_cache 240 2 0% 0.02K 1 240 4K jbd2_revoke_table_s 5 3 60% 0.71K 1 5 4K fat_inode_cache 0 0 0% 4096.00K 0 1 0K kmalloc-4194304 0 0 0% 0.11K 0 36 0K iint_cache 0 0 0% 0.45K 0 8 0K user_namespace 0 0 0% 0.30K 0 13 0K blkdev_requests 0 0 0% 0.94K 0 4 0K PING 0 0 0% 0.75K 0 5 0K xfrm_state 0 0 0% 0.23K 0 17 0K posix_timers_cache 0 0 0% 0.03K 0 124 0K dnotify_struct 0 0 0% 0.80K 0 5 0K ext2_inode_cache 0 0 0% 0.03K 0 124 0K jbd2_revoke_record_s 0 0 0% 0.62K 0 6 0K isofs_inode_cache 0 0 0% 0.73K 0 5 0K udf_inode_cache 0 0 0% 0.18K 0 22 0K xfs_log_ticket 0 0 0% 0.22K 0 18 0K xfs_btree_cur 0 0 0% 0.47K 0 8 0K xfs_da_state 0 0 0% 0.27K 0 15 0K xfs_buf_item 0 0 0% 0.43K 0 9 0K xfs_efd_item 0 0 0% 0.94K 0 4 0K xfs_inode 0 0 0% 0.17K 0 23 0K xfs_rud_item 0 0 0% 0.68K 0 11 0K xfs_rui_item 0 0 0% 0.21K 0 18 0K xfs_bui_item 0 0 0% 0.49K 0 8 0K xfs_dquot 0 0 0% 0.52K 0 7 0K xfs_dqtrx 0 0 0% 0.12K 0 34 0K cfq_io_cq 0 0 0% 0.29K 0 13 0K request_sock_TCPv6 0 0 0% 0.03K 0 124 0K fat_cache 0 0 0% 0.62K 0 6 0K rpc_inode_cache 0 0 0% 0.35K 0 11 0K nfs_direct_cache root [ /home/capv ]#
From: Vamsi Krishna Brahmajosyula @.> Sent: Friday, January 19, 2024 5:14 AM To: vmware/photon @.> Cc: Julius Kevinezz @.>; Mention @.> Subject: {EXT} Re: [vmware/photon] Photon 3.0 cgroup exhaustion. VM shuts down with CPU disabled (Issue #1535)
Also, can you please share the output of slabtop -sc --once in the nodes which experiance this issue often.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/vmware/photon/issues/1535*issuecomment-1900123203__;Iw!!OxqwZwwAVvYGpJMosQ!YJx3mpFPV6hPCse6qBl4r_Z7lcx9wJItQ2GW0lscO9VlFBvIEZwp2uk7dn3T3UIdGWhCalMt4lT05kJHC5gyOw-b$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AILMTBAMTCXVNNSM67ORISTYPJBQBAVCNFSM6AAAAABCAMHJROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGEZDGMRQGM__;!!OxqwZwwAVvYGpJMosQ!YJx3mpFPV6hPCse6qBl4r_Z7lcx9wJItQ2GW0lscO9VlFBvIEZwp2uk7dn3T3UIdGWhCalMt4lT05kJHC0xX-Icg$. You are receiving this because you were mentioned.Message ID: @.***>
2023-12-04T10:49:30.103Z In(05) vcpu-5 - Guest: <4>[39261021.091603] Call Trace:
2023-12-04T10:49:30.103Z In(05) vcpu-5 - Guest: <4>[39261021.091611] dump_stack+0x6d/0x8b
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091614] dump_header+0x6c/0x282
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091619] oom_kill_process+0x243/0x270
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091620] out_of_memory+0x100/0x4e0
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091624] mem_cgroup_out_of_memory+0xa4/0xc0
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091626] try_charge+0x700/0x740
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091628] ? alloc_pages_nodemask+0xdc/0x250
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091630] mem_cgroup_try_charge+0x86/0x190
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091632] mem_cgroup_try_charge_delay+0x1d/0x40
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091636] handle_mm_fault+0x823/0xee0
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091639] ? switch_to_asm+0x35/0x70
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091640] handle_mm_fault+0xde/0x240
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091643] __do_page_fault+0x226/0x4b0
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091644] do_page_fault+0x2d/0xf0
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091646] ? page_fault+0x8/0x30
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091646] page_fault+0x1e/0x30
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091648] RIP: 0033:0x1321050
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091650] Code: 31 f6 41 ba 7f 00 00 00 41 bb fd ff ff ff 0f b6 10 44 0f b6 65 df 48 83 c0 01 84 d2 78 3f 41 80 fc 0c 75 39 66 0f 1f 44 00 00 <66> 89 13 48 83 c3 02 49 39 c1 77 d8 0f 1f 40 00 48 8d 7d df e8 c7
2023-12-04T10:49:30.104Z In(05) vcpu-5 - Guest: <4>[39261021.091650] RSP: 002b:00007fffffa40470 EFLAGS: 00010246
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <4>[39261021.091651] RAX: 00007f13068e5787 RBX: 00001e4cc8552000 RCX: 0000000000000001
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <4>[39261021.091652] RDX: 0000000000000022 RSI: 0000000000000000 RDI: 000000000000003f
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <4>[39261021.091652] RBP: 00007fffffa404a0 R08: 000000000000000c R09: 00007f1306e15e1d
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <4>[39261021.091653] R10: 000000000000007f R11: 00000000fffffffd R12: 000000000000000c
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <4>[39261021.091653] R13: 00007f130686be5a R14: 0000000000c78e0d R15: 00007f130619d010
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091654] Task in /kubepods/burstable/pod69a00a89-5b57-4b35-9456-34f1b9f1d46f/9cc2d291785856a8b81704d96ca165405e2722f0d18e1a43a8154f457b0cbe18 killed as a result of limit of /kubepods/burstable/pod69a00a89-5b57-4b35-9456-34f1b9f1d46f
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091660] memory: usage 614400kB, limit 614400kB, failcnt 81016
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091660] memory+swap: usage 614400kB, limit 9007199254740988kB, failcnt 0
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091661] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091661] Memory cgroup stats for /kubepods/burstable/pod69a00a89-5b57-4b35-9456-34f1b9f1d46f: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091666] Memory cgroup stats for /kubepods/burstable/pod69a00a89-5b57-4b35-9456-34f1b9f1d46f/2a6484f8f136d41bf17b6925821421216e8b9aa524aff498a5efa1e7ac037f9c: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091670] Memory cgroup stats for /kubepods/burstable/pod69a00a89-5b57-4b35-9456-34f1b9f1d46f/9cc2d291785856a8b81704d96ca165405e2722f0d18e1a43a8154f457b0cbe18: cache:0KB rss:613540KB rss_huge:108544KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:614184KB inactive_file:4KB active_file:0KB unevictable:0KB
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091673] Tasks state (memory values in pages):
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091673] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091875] [ 23437] 0 23437 242 1 28672 0 -998 pause
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091920] [ 29498] 1000 29498 226154 29849 1282048 0 999 npm start
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091921] [ 29859] 1000 29859 620 17 45056 0 999 sh
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <6>[39261021.091922] [ 29860] 1000 29860 5590533 137614 7958528 0 999 node
2023-12-04T10:49:30.105Z In(05) vcpu-5 - Guest: <3>[39261021.091938] Memory cgroup out of memory: Kill process 29860 (node) score 1903 or sacrifice child
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <3>[39261021.092001] Killed process 29860 (node) total-vm:22362132kB, anon-rss:517572kB, file-rss:32884kB, shmem-rss:0kB
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <6>[39261021.113862] oom_reaper: reaped process 29860 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880124] node invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=999
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <6>[39261111.880126] node cpuset=64ecfc841af390d50da31f6be0a840d979b1a0178f2a9b1ddb9676162b3654ad mems_allowed=0-1
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880132] CPU: 10 PID: 30411 Comm: node Tainted: G W 4.19.189-5.ph3 #1-photon
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880134] Call Trace:
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880143] dump_stack+0x6d/0x8b
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880150] dump_header+0x6c/0x282
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880156] oom_kill_process+0x243/0x270
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880160] out_of_memory+0x100/0x4e0
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880165] mem_cgroup_out_of_memory+0xa4/0xc0
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880167] try_charge+0x700/0x740
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880170] ? alloc_pages_nodemask+0xdc/0x250
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880173] mem_cgroup_try_charge+0x86/0x190
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880175] mem_cgroup_try_charge_delay+0x1d/0x40
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880179] handle_mm_fault+0x823/0xee0
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880183] ? switch_to_asm+0x35/0x70
2023-12-04T10:49:30.106Z In(05) vcpu-5 - Guest: <4>[39261111.880186] handle_mm_fault+0xde/0x240
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880190] do_page_fault+0x226/0x4b0
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880191] do_page_fault+0x2d/0xf0
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880194] ? page_fault+0x8/0x30
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880195] page_fault+0x1e/0x30
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880197] RIP: 0033:0x7f078f501b97
2023-12-04T10:49:30.107Z In(05) vcpu-5 - Guest: <4>[39261111.880199] Code: 48 39 f7 72 17 74 25 4c 8d 0c 16 4c 39 cf 0f 82 2a 02 00 00 48 89 f9 48 29 f1 eb 06 48 89 f1 48 29 f9 83 f9 3f 76 7b 48 89 d1
Here some thoughts.
You have a pod with memory limit set to 614400kB and total-vm of 22318872kB. The memory limit is reached (mem_cgroup_out_of_memory), and for higher memory consumption oom-killer (out of memory) kicks in. The first process eligible for this is 29860. Problematic is the fact that afterwards, oom_reaper reports that it didn't gain anything with that, see "now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB". This happens for the cascade of processes 29860, 30411, 26868, 29920, 4274, 3243, 15616, 9308, 2625, 19388. So, why just pain and no gain? Unfortunately I'm not skilled enough to read the slabtop output. The kubernetes case Container Limit cgroup causing OOMkilled still is open. 'Tasks state (memory values in pages)' doesn't list RssAnon (Size of resident anonymous memory), RssFile (Size of resident file mappings) and RssShmem (Size of resident shared memory). This has been addressed lately in a commit. In addition, this happens for higher kernel versions in constellations with cgroupv1 as well, see bugzilla bug 207273. Btw. cgroupv2 has been introduced a while ago, see https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2. Ph4 + Ph5 support cgroupv2.
Describe the bug
What we are seeing is, we use photon 3.0 as the OS for our Tanzu K8S nodes. What we are seeing is randomly we see an error cgroups exhaution. then we see the VM reboot, and in vCenter events it shows cpu has been disabled
Reproduction steps
1. 2. 3. ... Haven't been able to reproduce manually, it just happens randomly.
Expected behavior
How do we find out whats causing the cgroup exhaustion and in return causing photon kernel to disable cpu and reboot itself
Additional context
No response