VMWare Player 17.5.0 + Linux Kernel 6.6.1-1.1 Shutdown crashes

mkubecek / vmware-host-modules

Patches needed to build VMware (Player and Workstation) host modules against recent kernels

GNU General Public License v2.0

2.27k stars 366 forks source link

VMWare Player 17.5.0 + Linux Kernel 6.6.1-1.1 Shutdown crashes #228

Open JoeSalmeri opened 10 months ago

JoeSalmeri commented 10 months ago

I update to TW 20231113 today which is using Kernel 6.6.1-1.1.

I am using VMWare WorkStation Player 17.5

The modules compile fine and then I sign them with my key and everything works fine.

When I bring up a VM it also works fine until I attempt to shut it down.

At that point it starts the vm shutdown but then the process hangs.

At that point the journal has the following messages:

    Nov 16 16:55:24 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 4-...D } 18235 jiffies s: 1213 root: 0x10/.
    Nov 16 16:55:24 kernel: rcu: blocking rcu_node structures (internal RCU debug):
    Nov 16 16:55:24 kernel: Sending NMI from CPU 3 to CPUs 4:
    Nov 16 16:55:24 kernel: NMI backtrace for cpu 4 skipped: idling at intel_idle+0x62/0xb0

TW becomes less responsive and I end up having to reboot.

Looking at the journal I found the following trace information after I rebooted

Nov 16 16:26:45 kernel: WARNING: CPU: 3 PID: 6026 at kernel/rcu/tree_plugin.h:734 rcu_sched_clock_irq+0xb2c/0x1120 Nov 16 16:26:45 kernel: Modules linked in: vmnet(O) vmmon(O) binfmt_misc snd_seq_dummy snd_hrtimer snd_seq af_packet nf_conntrack_netbios_ns nf_conntrack_b> Nov 16 16:26:45 kernel: irqbypass wmi_bmof rfkill i2c_i801 mxm_wmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi pcspkr i2c_smbus efi_pstore uvcvideo > Nov 16 16:26:45 kernel: CPU: 3 PID: 6026 Comm: vmware-vmx Tainted: G O 6.6.1-1-default #1 openSUSE Tumbleweed 0c6504f7d2c054731662677f280b3> Nov 16 16:26:45 kernel: Hardware name: ASUS All Series/MAXIMUS VI FORMULA, BIOS 1603 08/15/2014 Nov 16 16:26:45 kernel: RIP: 0010:rcu_sched_clock_irq+0xb2c/0x1120 Nov 16 16:26:45 kernel: Code: 38 08 00 00 85 c0 0f 84 f2 f5 ff ff e9 98 fc ff ff c6 87 39 08 00 00 01 e9 e1 f5 ff ff 4c 89 e7 e8 b9 8e f3 ff e9 0e ff ff ff> Nov 16 16:26:45 kernel: RSP: 0018:ffffc9000019ce08 EFLAGS: 00010082 Nov 16 16:26:45 kernel: RAX: 00000000ffffffc2 RBX: 0000000000000000 RCX: 0000000009e820b1 Nov 16 16:26:45 kernel: RDX: 000000000000c773 RSI: ffffffff9739b328 RDI: ffff8881bbe75180 Nov 16 16:26:45 kernel: RBP: ffff8888209a8200 R08: 0000000000000000 R09: 0000000000000000 Nov 16 16:26:45 kernel: R10: 0000000000000000 R11: ffffc9000019cff8 R12: ffff8888209aac80 Nov 16 16:26:45 kernel: R13: ffffc90000cabb98 R14: ffff8888209aac90 R15: ffff8888209aa740 Nov 16 16:26:45 kernel: FS: 00007fdb08868c00(0000) GS:ffff888820980000(0000) knlGS:0000000000000000 Nov 16 16:26:45 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 16 16:26:45 kernel: CR2: 00007fdb060e8000 CR3: 0000000184474005 CR4: 00000000001706e0 Nov 16 16:26:45 kernel: Call Trace: Nov 16 16:26:45 kernel: Nov 16 16:26:45 kernel: ? rcu_sched_clock_irq+0xb2c/0x1120 Nov 16 16:26:45 kernel: ? warn+0x81/0x130 Nov 16 16:26:45 kernel: ? rcu_sched_clock_irq+0xb2c/0x1120 Nov 16 16:26:45 kernel: ? report_bug+0x171/0x1a0 Nov 16 16:26:45 kernel: ? handle_bug+0x3c/0x80 Nov 16 16:26:45 kernel: ? exc_invalid_op+0x17/0x70 Nov 16 16:26:45 kernel: ? asm_exc_invalid_op+0x1a/0x20 Nov 16 16:26:45 kernel: ? rcu_sched_clock_irq+0xb2c/0x1120 Nov 16 16:26:45 kernel: ? load_balance+0x2e9/0xed0 Nov 16 16:26:45 kernel: ? reweight_entity+0x273/0x280 Nov 16 16:26:45 kernel: ? update_load_avg+0x7e/0x780 Nov 16 16:26:45 kernel: update_process_times+0x5f/0x90 Nov 16 16:26:45 kernel: tick_sched_handle+0x21/0x60 Nov 16 16:26:45 kernel: tick_sched_timer+0x6f/0x90 Nov 16 16:26:45 kernel: ? __pfx_tick_sched_timer+0x10/0x10 Nov 16 16:26:45 kernel: hrtimer_run_queues+0x112/0x2b0 Nov 16 16:26:45 kernel: hrtimer_interrupt+0xf8/0x230 Nov 16 16:26:45 kernel: __sysvec_apic_timer_interrupt+0x50/0x140 Nov 16 16:26:45 kernel: sysvec_apic_timer_interrupt+0x6d/0x90 Nov 16 16:26:45 kernel: Nov 16 16:26:45 kernel: Nov 16 16:26:45 kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20 Nov 16 16:26:45 kernel: RIP: 0010:rep_movs_alternative+0x4a/0x70 Nov 16 16:26:45 kernel: Code: 75 f1 c3 cc cc cc cc 66 0f 1f 84 00 00 00 00 00 48 8b 06 48 89 07 48 83 c6 08 48 83 c7 08 83 e9 08 74 df 83 f9 08 73 e8 eb c9> Nov 16 16:26:45 kernel: RSP: 0018:ffffc90000cabc48 EFLAGS: 00010206 Nov 16 16:26:45 kernel: RAX: 00007fdb060e9010 RBX: 0000000000001000 RCX: 00000000000005e0 Nov 16 16:26:45 kernel: RDX: 0000000000000000 RSI: ffff8883221dca20 RDI: 00007fdb060e8a30 Nov 16 16:26:45 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000135e000 Nov 16 16:26:45 kernel: R10: 000000000000000f R11: 000000000135e000 R12: ffffc90000cabe18 Nov 16 16:26:45 kernel: R13: 0000000000001000 R14: ffff8883221dc000 R15: 0000000000000000 Nov 16 16:26:45 kernel: copyout+0x20/0x30 Nov 16 16:26:45 kernel: _copy_to_iter+0x5e/0x4a0 Nov 16 16:26:45 kernel: copy_page_to_iter+0x8b/0x140 Nov 16 16:26:45 kernel: filemap_read+0x1af/0x320 Nov 16 16:26:45 kernel: vfs_read+0x1b8/0x300 Nov 16 16:26:45 kernel: ksys_read+0x67/0xe0 Nov 16 16:26:45 kernel: do_syscall_64+0x60/0x90 Nov 16 16:26:45 kernel: ? do_user_addr_fault+0x20f/0x660 Nov 16 16:26:45 kernel: ? exc_page_fault+0x71/0x160 Nov 16 16:26:45 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Nov 16 16:26:45 kernel: RIP: 0033:0x7fdb0830a3bc Nov 16 16:26:45 kernel: Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b7 18 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05> Nov 16 16:26:45 kernel: RSP: 002b:00007fff1393dc10 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Nov 16 16:26:45 kernel: RAX: ffffffffffffffda RBX: 0000000000553f88 RCX: 00007fdb0830a3bc Nov 16 16:26:45 kernel: RDX: 0000000000553f88 RSI: 00007fdb060aa010 RDI: 000000000000004c Nov 16 16:26:45 kernel: RBP: 000055754832d8c0 R08: 0000000000000000 R09: 0000000000000000 Nov 16 16:26:45 kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000553f88 Nov 16 16:26:45 kernel: R13: 0000000000000027 R14: 00007fdb060aa010 R15: 0000000000000001 Nov 16 16:26:45 kernel: Nov 16 16:26:45 kernel: ---[ end trace 0000000000000000 ]---

If I boot up using Kernel 6.5.9.1 shutting down the same vm does not cause those same issues.

This journal entry also made me think it is a kernel issue

RIP: 0010:rcu_sched_clock_irq+0xb2c/0x1120

I was thinking it was the 6.6 kernel because it is supposed to include a new CPU scheduler which promises to improve performance and reduce latency and those messages, especially the RIP message sound like they might be related to that .

Anybody else using vmware 17.5 with Kernel 6.6.1-1.1 ?

mkubecek commented 10 months ago

You forgot to mention what source did you use to build the modules. As I have actually seen this warning before, I suspect that it was either unpatched source from VMware or an older snapshot of workstation-17.5.0 branch without commit 4c2a103fd2d7 ("vmmon: use get_user_pages to get page PFN").

(I also reported this issue on on VMware Communities website but nobody seems to care.)

JoeSalmeri commented 10 months ago

SORRY, my bad !

I stopped bothering with vmware communities a while back because as you said nobody seems to care.

Since Tumbleweed is a rolling distro, I usually try the source modules provided by VMWare first and then if they have an issue, I replace them with your modules. ( THANKS for maintain them ! )

Looking at my notes I see that when I updated to kernel 6.5.9.1 I also updated to vmware 17.5.0 at the same time and when I did that I switched to back to the vmware modules and they worked ( after I signed them ).

The latest TW release now has the new 6.6.1-1.1 kernel and that's where the problem happened, I'll pull your latest 17.5 modules and see if that resolves it and report back.

THANK YOU !

JoeSalmeri commented 10 months ago

Ok, I got the latest workstation 17.5.0 modules, compiled them, and signed them.

The VM ( Win10 in case it matters ) comes up and seems to work fine ( just like before ) but instead of the errors above which occurred when I shutdown the VM and which forced me to reboot to recover, with the latest workstation 17.5.0 modules, when you shutdown the VM now it coredumps but does not hang linux or cause me to have to reboot.

Here are the systemd journal entries

Nov 19 09:21:16 vmnetBridge[5293]: RTM_NEWLINK: name:eno1 index:2 flags:0x00011043 Nov 19 09:21:16 kernel: e1000e 0000:00:19.0 eno1: entered promiscuous mode Nov 19 09:21:16 kernel: bridge-eno1: enabled promiscuous mode Nov 19 09:21:16 kernel: Lockdown: vmx-vcpu-0: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7 Nov 19 09:22:40 vmnetBridge[5293]: RTM_NEWLINK: name:eno1 index:2 flags:0x00011043 Nov 19 09:22:40 kernel: e1000e 0000:00:19.0 eno1: left promiscuous mode Nov 19 09:22:40 kernel: bridge-eno1: disabled promiscuous mode Nov 19 09:22:41 plasmashell[5672]: Unexpected signal: 11. Nov 19 09:22:41 systemd[1]: Started Process Core Dump (PID 5841/UID 0). ¦¦ Subject: A start job for unit systemd-coredump@2-5841-0.service has finished successfully ¦¦ Defined-By: systemd ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ¦¦ ¦¦ A start job for unit systemd-coredump@2-5841-0.service has finished successfully. ¦¦ ¦¦ The job identifier is 4234. Nov 19 09:22:41 systemd-coredump[5842]: [??] Process 5840 (vmplayer) of user 1000 dumped core.

                                                           Module libcds.so without build-id.
                                                           Stack trace of thread 5840:
                                                           #0  0x00007f1d221161bd syscall (libc.so.6 + 0x1161bd)
                                                           #1  0x00007f1d1e24d723 n/a (libvmwarebase.so + 0x24d723)
                                                           #2  0x00007f1d1e24da59 n/a (libvmwarebase.so + 0x24da59)
                                                           #3  0x00007f1d1e153993 Panic_Panic (libvmwarebase.so + 0x153993)
                                                           #4  0x00007f1d1e153a2c Panic (libvmwarebase.so + 0x153a2c)
                                                           #5  0x00007f1d1e24c510 n/a (libvmwarebase.so + 0x24c510)
                                                           #6  0x00007f1d1e24d337 n/a (libvmwarebase.so + 0x24d337)
                                                           #7  0x00007f1d2203f190 __restore_rt (libc.so.6 + 0x3f190)
                                                           #8  0x00007f1d1d060fa0 _ZNK3cui3MKS22GetGuestTopologyLimitsERjS1_S1_S1_Rc (libvmwareui.so + 0x1060fa0)
                                                           #9  0x00007f1d1cf8c67d _ZN3cui19IsTopologySupportedERKNS_2VMERKSt6vectorINS_4RectESaIS4_EERbS9_ (libvmwareui.so + 0xf8c67d)
                                                           #10 0x00007f1d1cd4ef63 _ZNK3cui13FullscreenMgr18CompatibleTopologyEPKNS_2VMERKSt6vectorIjSaIjEERbS9_ (libvmwareui.so + 0xd4ef63)
                                                           #11 0x00007f1d1cd4ff88 _ZN3cui13FullscreenMgr11CanMultiMonEPNS_2VMEPSt6vectorIN3utf6stringESaIS5_EEb (libvmwareui.so + 0xd4ff88)
                                                           #12 0x00007f1d1d40cc4a _ZN3lui13FullscreenMgr11CanMultiMonEPN3cui2VMEPSt6vectorIN3utf6stringESaIS6_EEb (libvmwareui.so + 0x140cc4a)
                                                           #13 0x00007f1d2159fb1b _ZN6player6Window16UpdateAddMonitorEv (libvmplayer.so + 0x11eb1b)
                                                           #14 0x00007f1d1cfa971d _ZNK3cui10Capability8EvaluateEv (libvmwareui.so + 0xfa971d)
                                                           #15 0x00007f1d1cfa9947 _ZN3cui10Capability18OnTestDisconnectedEPv (libvmwareui.so + 0xfa9947)
                                                           #16 0x00007f1d1cd30e11 _ZN3cui16ClearConnectionsISt4listIN4sigc10connectionESaIS3_EEEEvRT_ (libvmwareui.so + 0xd30e11)
                                                           #17 0x00007f1d1cea6ef7 _ZN3cui14VMCapabilities10ConnectMKSEv (libvmwareui.so + 0xea6ef7)
                                                           #18 0x00007f1d1ce73aaa _ZN3cui2VM8UnsetMKSEPNS_3MKSE (libvmwareui.so + 0xe73aaa)
                                                           #19 0x00007f1d1d45f0dd _ZN3lui2VM8UnsetMKSEPN3cui3MKSE (libvmwareui.so + 0x145f0dd)
                                                           #20 0x00007f1d215679ed _ZN6player6Player7CloseVMEN4sigc4slotIvbRKN3cui5ErrorENS1_3nilES7_S7_S7_S7_EENS2_IvS7_S7_S7_S7_S7_S7_S7_EE (libvmplayer.so + 0xe69ed)
                                                           #21 0x00007f1d2156f4a2 _ZN4sigc8internal10slot_call2INS_18bound_mem_functor2IvN6player6PlayerENS_4slotIvbRKN3cui5ErrorENS_3nilESA_SA_SA_SA_EENS5_IvSA_SA_SA_SA_SA_SA_SA_EEEEvSB_SC_E7call_itEPNS0_8slot_repERKSB_RKSC_ (libvmplayer.so + 0xee4a2)
                                                           #22 0x00007f1d1d0f3faa _ZN3cui15LoggedSlotChain11SlotWrapperEN4sigc4slotIvbRKNS_5ErrorENS1_3nilES6_S6_S6_S6_EENS2_IvS6_S6_S6_S6_S6_S6_S6_EERKN3utf6stringENS2_IvS7_S8_S6_S6_S6_S6_S6_EE (libvmwareui.so + 0x10f3faa)
                                                           #23 0x00007f1d1d0f4b4b _ZN4sigc8internal10slot_call2INS_12bind_functorILin1ENS_18bound_mem_functor4IvN3cui15LoggedSlotChainENS_4slotIvbRKNS4_5ErrorENS_3nilESA_SA_SA_SA_EENS6_IvSA_SA_SA_SA_SA_SA_SA_EERKN3utf6stringENS6_IvSB_SC_SA_SA_SA_SA_SA_EEEESE_SH_SA_SA_SA_SA_SA_EEvSB_SC_E7call_itEPNS0_8slot_repERKSB_RKSC_ (libvmwareui.so + 0x10f4b4b)
                                                           #24 0x00007f1d1cfb34b8 _ZN3cui9SlotChain8NextSlotEj (libvmwareui.so + 0xfb34b8)
                                                           #25 0x00007f1d2156c19a _ZN4sigc8internal10slot_call0INS_19bind_return_functorIbNS_4slotIvNS_3nilES4_S4_S4_S4_S4_S4_EEEEbE7call_itEPNS0_8slot_repE (libvmplayer.so + 0xeb19a)
                                                           #26 0x00007f1d2104a93d n/a (libglibmm-2.4.so.1 + 0x4a93d)
                                                           #27 0x00007f1d21397924 n/a (libglib-2.0.so.0 + 0x5e924)
                                                           #28 0x00007f1d21394f30 n/a (libglib-2.0.so.0 + 0x5bf30)
                                                           #29 0x00007f1d21396b58 n/a (libglib-2.0.so.0 + 0x5db58)
                                                           #30 0x00007f1d2139742f g_main_loop_run (libglib-2.0.so.0 + 0x5e42f)
                                                           #31 0x00007f1d201f6c2d gtk_main (libgtk-3.so.0 + 0x1f6c2d)
                                                           #32 0x00007f1d21531fea main (libvmplayer.so + 0xb0fea)
                                                           #33 0x000055df0d42fa50 n/a (appLoader + 0x1ca50)
                                                           #34 0x000055df0d42bba0 n/a (appLoader + 0x18ba0)
                                                           #35 0x00007f1d220281b0 __libc_start_call_main (libc.so.6 + 0x281b0)
                                                           #36 0x00007f1d22028279 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x28279)
                                                           #37 0x000055df0d42c045 n/a (appLoader + 0x19045)
                                                           ELF object binary architecture: AMD x86-64

¦¦ Subject: Process 5840 (vmplayer) dumped core ¦¦ Defined-By: systemd ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ¦¦ Documentation: man:core(5) ¦¦ ¦¦ Process 5840 (vmplayer) crashed and dumped core. ¦¦ ¦¦ This usually indicates a programming error in the crashing program and ¦¦ should be reported to its vendor as a bug. Nov 19 09:22:41 systemd[1]: systemd-coredump@2-5841-0.service: Deactivated successfully. ¦¦ Subject: Unit succeeded ¦¦ Defined-By: systemd ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ¦¦ ¦¦ The unit systemd-coredump@2-5841-0.service has successfully entered the 'dead' state. Nov 19 09:22:41 plasmashell[5672]: VMware Player Error: Nov 19 09:22:41 plasmashell[5672]: VMware Player unrecoverable error: (vmplayer) Nov 19 09:22:41 plasmashell[5672]: Unexpected signal: 11. Nov 19 09:22:41 plasmashell[5672]: A log file is available in "/tmp/vmware-joe/vmware-vmplayer-5672.log". Nov 19 09:22:41 plasmashell[5672]: You can request support. Nov 19 09:22:41 plasmashell[5672]: To collect data to submit to VMware technical support, run "vm-support". Nov 19 09:22:41 plasmashell[5672]: We will respond on the basis of your support entitlement. Nov 19 09:22:41 systemd[1670]: app-vmware\x2dplayer-bb6650976f1b44908e4d7bb4a508213c.scope: Consumed 2min 21.436s CPU time. ¦¦ Subject: Resources consumed by unit runtime ¦¦ Defined-By: systemd ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ¦¦ ¦¦ The unit UNIT completed and consumed the indicated resources.

I also saved the "/tmp/vmware-joe/vmware-vmplayer-5672.log" file in case you want to see that too.

mkubecek commented 10 months ago

I was a bit afraid there might be some problem like this. Unfortunately this is a closed source application so there is no way to debug this for anyone except VMware. And VMware won't care until a "supported host operating system" with 6.6+ kernel appears. :-(

On the other hand, based on function names from the stack trace, it rather looks like a problem between GUI and your desktop environment, i.e. not really related to what the kernel modules are doing.

JoeSalmeri commented 10 months ago

It sucks, that they don't consider openSUSE Tumbleweed a supported host operating system since it is now using the 6.6 kernel.

I'm curious, which function names in the stack trace makes it look like a GUI / Desktop problem ?

FWIW, it also coredumped on the 6.5.9.1 kernel at shutdown too. But when 6.6.1.1 was installed that caused Linux to slowly become unresponsive forcing a reboot.

With your patch ( THANKS ! ) I've been using 6.6.1.1 now for a few days with no issues other than the coredump at shutdown.

Might be time to reconsider moving to KVM again.

bassman56 commented 10 months ago

same problem for me, crashing the whole host with similar messages as above with vmware 17.5 and both the p17.0.1 and w17.0.2. modules (I replaced pte_offset_map to pte_offset_kernel to make it compile) . I had to roll all back and I now stay on Tumbleweed from August which has 6.4.11 kernel for this reason. with 6.4.11 everything still works fine, but I now have a backlog of 4013 packages to update... any advice? switch to Virtualbox?

mkubecek commented 10 months ago

I'm curious, which function names in the stack trace makes it look like a GUI / Desktop problem ?

Most of all, frames 10-13, in particular the FullscreenMgr, Window, UpdateAddMonitorEv and CanMultiMonEPN parts.

FWIW, it also coredumped on the 6.5.9.1 kernel at shutdown too. But when 6.6.1.1 was installed that caused Linux to slowly become unresponsive forcing a reboot.

The coredump with 6.5.9 kernel was with unpatched modules from VMware? That would suggest it's a userspace problem unrelated to these modules.

With your patch ( THANKS ! ) I've been using 6.6.1.1 now for a few days with no issues other than the coredump at shutdown.

Far from perfect but certainly better than paralyzing the whole system as soon as you start a VM.

Might be time to reconsider moving to KVM again.

That's one of the options, sure.

mkubecek commented 10 months ago

same problem for me, crashing the whole host with similar messages as above with vmware 17.5 and both the p17.0.1 and w17.0.2. modules

That's a bad idea, you should use branch for your VMware version, workstation-17.5.0 in your case. Also, w17.0.2 is a tag marking unpatched module source, i.e. exactly the same as provided by VMware. If you build those, it's the same as not using this repository at all.

I replaced pte_offset_map to pte_offset_kernel to make it compile

Another bad idea, even if that's exactly what VMware decided to do - but that's what this issue is about.

but I now have a backlog of 4013 packages to update

You can always add lock for kernel packages (e.g. zypper addlock 'kernel-*') and update the rest.

any advice?

Try updated workstation-17.5.0 branch instead. Unless you run into the same issue as JoeSalmeri (which may not even be related), that's what I would suggest.

bassman56 commented 10 months ago

Hi Michal, thanks for your help! I followed your advice, I froze 'kernel-*' and the virtual box rpms (i am using virtual box and vmware for different legacy machines). Then update 4200+ packets - seems successful. Then I tried to re-install vmware 17.5.0. (to confirm the correct modules to be used). When starting vmplayer, the module creation failed at first. Then i copied /usr/lib/vmware/modules/source/vmmon.tar and vmnet.tar into your environent, replacing the sources of vmmon and vmnet. after make and make install, everythings seems to run fine, although the process complained not to use the same compiler. Anyhow - while still on kernel 6.4.11 I can bring up the vmware virtual machines. So maybe I can keep this setup for a few weeks.... until hopefully we have modules which match kernel 6.6.x....

One more question: the vmware service seems to have been installed on /etc/init.d/vmware where I need to restart the vmware service manually after reach reboot. How can I get this back to /usr/lib/systemd/system/vmware.service where it actually belongs?

bassman56 commented 10 months ago

Hi Michal et al, I tried to research further on kernel 6.6.2 and vmware 17.5.0 . I created a Virtual Box Machine with the latest tumbleweed (kernel 6.6.2-1-default) and installed vmware 17.5.0 with the related modules from this version. The I created a small ubuntu VMware machine. when I start this machine, the dmesg (from inside the Virtualbox machine) showed the following log and the ubuntu vmware machine crashed. Very similar to what JoeSalmeri was reporting.

[ 77.968002] bridge-enp0s3: disabled promiscuous mode [ 149.227011] ------------[ cut here ]------------ [ 149.227019] WARNING: CPU: 3 PID: 2608 at kernel/rcu/tree_exp.h:787 rcu_exp_handler+0x35/0xe0 [ 149.227031] Modules linked in: bluetooth ecdh_generic snd_seq_dummy snd_hrtimer snd_seq snd_seq_device af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ppdev vmnet(OE) parport_pc parport nf_tables ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat vmmon(OE) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter vboxnetadp(O) vboxnetflt(O) qrtr vboxdrv(O) intel_rapl_msr intel_rapl_common intel_pmc_core kvm_intel snd_intel8x0 snd_ac97_codec ac97_bus kvm snd_pcm snd_timer tiny_power_button irqbypass pcspkr snd soundcore e1000 i2c_piix4 ac button joydev fuse efi_pstore configfs dmi_sysfs ip_tables x_tables hid_generic usbhid crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul sr_mod cdrom ata_generic ghash_clmulni_intel ata_piix sha512_ssse3 aesni_intel crypto_simd ahci cryptd libahci ohci_pci [ 149.227085] ehci_pci vmwgfx ohci_hcd video ehci_hcd vboxguest(O) drm_ttm_helper usbcore libata wmi ttm serio_raw btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq vmw_vsock_vmci_transport vmw_vmci vsock dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua sd_mod t10_pi sg scsi_mod scsi_common msr [ 149.227104] CPU: 3 PID: 2608 Comm: vmware-vmx Tainted: G W OE 6.6.2-1-default #1 openSUSE Tumbleweed eca1e326dd99ff90b88e5b469b03d9db59223dcf [ 149.227107] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 149.227108] RIP: 0010:rcu_exp_handler+0x35/0xe0 [ 149.227111] Code: 55 65 48 8b 2c 25 c0 9a 03 00 53 8b 85 34 08 00 00 48 c7 c3 40 b1 03 00 65 48 03 1d 15 64 ca 72 4c 8b 63 18 85 c0 74 0d 7f 58 <0f> 0b 5b 5d 41 5c c3 cc cc cc cc 65 8b 05 c1 04 cc 72 66 85 c0 74 [ 149.227112] RSP: 0018:ffff9600c0160f98 EFLAGS: 00010082 [ 149.227114] RAX: 00000000ffffff6e RBX: ffff898b17dbb140 RCX: ffff898b17d3b500 [ 149.227115] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 149.227116] RBP: ffff898a498e5180 R08: 0000000000000000 R09: 0000000000000000 [ 149.227117] R10: 0000000000000000 R11: ffff9600c0160ff8 R12: ffffffff8f15d600 [ 149.227118] R13: ffffffff8d3795c0 R14: 0000000000000000 R15: 0000000000000000 [ 149.227119] FS: 00007f919f190c00(0000) GS:ffff898b17d80000(0000) knlGS:0000000000000000 [ 149.227120] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 149.227121] CR2: 00007f919c9fc000 CR3: 00000001a162a001 CR4: 00000000000706e0 [ 149.227131] Call Trace: [ 149.227133] [ 149.227134] ? rcu_exp_handler+0x35/0xe0 [ 149.227136] ? warn+0x81/0x130 [ 149.227141] ? rcu_exp_handler+0x35/0xe0 [ 149.227170] ? report_bug+0x171/0x1a0 [ 149.227175] ? handle_bug+0x3c/0x80 [ 149.227178] ? exc_invalid_op+0x17/0x70 [ 149.227180] ? asm_exc_invalid_op+0x1a/0x20 [ 149.227184] ? __pfx_rcu_exp_handler+0x10/0x10 [ 149.227188] ? rcu_exp_handler+0x35/0xe0 [ 149.227190] flush_smp_call_function_queue+0x10c/0x410 [ 149.227194] __sysvec_call_function_single+0x1c/0xc0 [ 149.227198] sysvec_call_function_single+0x6d/0x90 [ 149.227201] [ 149.227201] [ 149.227202] asm_sysvec_call_function_single+0x1a/0x20 [ 149.227205] RIP: 0010:_raw_spin_unlock_irq+0x15/0x30 [ 149.227207] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 e8 72 00 00 00 90 fb 0f 1f 44 00 00 <65> ff 0d 2c d9 18 72 74 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc [ 149.227208] RSP: 0018:ffff9600c360bc10 EFLAGS: 00000246 [ 149.227210] RAX: 0000000000000001 RBX: ffffb7dfc6956180 RCX: ffff898b17d80000 [ 149.227211] RDX: 0000000000000016 RSI: 0000000000000003 RDI: ffff898a12bfac10 [ 149.227212] RBP: 0000000000000000 R08: 00000000000405c0 R09: 0000000000000017 [ 149.227213] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000cc0 [ 149.227213] R13: ffff898a12bfac08 R14: 0000000000000001 R15: 0000000000000001 [ 149.227216] shmem_add_to_page_cache+0x150/0x2b0 [ 149.227221] shmem_get_folio_gfp+0x27f/0x710 [ 149.227223] shmem_fallocate+0x39b/0x520 [ 149.227227] vfs_fallocate+0x13c/0x360 [ 149.227230] __x64_sys_fallocate+0x44/0x70 [ 149.227232] do_syscall_64+0x60/0x90 [ 149.227234] ? do_futex+0xc6/0x190 [ 149.227237] ? exit_to_user_mode_prepare+0x142/0x1f0 [ 149.227239] ? syscall_exit_to_user_mode+0x2b/0x40 [ 149.227241] ? do_syscall_64+0x6c/0x90 [ 149.227242] ? do_syscall_64+0x6c/0x90 [ 149.227243] ? syscall_exit_to_user_mode+0x2b/0x40 [ 149.227245] ? do_syscall_64+0x6c/0x90 [ 149.227246] ? exc_page_fault+0x71/0x160 [ 149.227248] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 149.227250] RIP: 0033:0x7f919ef161bd [ 149.227300] Code: 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 33 4c 0d 00 f7 d8 64 89 01 48 [ 149.227302] RSP: 002b:00007ffe5aefd408 EFLAGS: 00000246 ORIG_RAX: 000000000000011d [ 149.227306] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f919ef161bd [ 149.227307] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000004d [ 149.227308] RBP: 0000000000000000 R08: 0000000000000000 R09: 000055dd0778e5e0 [ 149.227309] R10: 0000000100000000 R11: 0000000000000246 R12: 000055dd08a06010 [ 149.227310] R13: 0000000100000000 R14: 0000000100000000 R15: 000000000000004d [ 149.227313] [ 149.227314] ---[ end trace 0000000000000000 ]--- [ 150.115024] e1000 0000:00:03.0 enp0s3: entered promiscuous mode [ 150.115686] bridge-enp0s3: enabled promiscuous mode

Any suggestion how to investigate further? Thanks!

mkubecek commented 10 months ago

and installed vmware 17.5.0 with the related modules from this version

What exactly does this mean? Unless you mean current head of workstation-17.5.0 branch in this repository (i.e. commit 4c2a103fd2d7), you have to report that issue somewhere else.

Please understand that I do not work for VMware and have no special relation or contract with them. Thus I have no more influence on what they ship in their products than any other customer who paid for a single Workstation license (in other words: none). All I'm trying to is to do my best to help their users (myself included) to work around some deficiencies in their development process.

It may be interesting to investigate further why exactly the pte_kernel_offset() hack results in these warnings, why it only appeared after switching from 6.5 to 6.6 or what makes openSUSE kernel different from other 6.6 based distribution kernels (different config options?) so that other users with 6.6 kernels are not affected (yet?). But I'm not really expert on RCU or memory management internals and the amount of time I can (and want to) devote to this work is limited so knowing that the get_user_pages approach does not suffer from it is enough for me at the moment.

bassman56 commented 10 months ago

Hi Michal, sorry my bad. I used originally the vmmon and vmnet which came with vmware 17.5.0. Now I compiled yours (head of workstation-17.5.0 branch) and installed these inside my Virtual Box. What I can report is that I can start the vmware service and I can start vmplayer, but when I start a very simple vm inside vmplayer (without any operation system) I cannot even reach the BIOS of this vm. The logfile is here: vmware.log I also tried to gdb the vmmcores file, but gdb complains "file format not recognized". any suggestion how to proceed?

mkubecek commented 10 months ago

I'm sorry, this is a problem in a closed source userspace application, i.e. something I cannot possibly help you with.

bassman56 commented 10 months ago

Hi JoeSalmeri, mkubecek et al. I can report that I seem to have fixed the above issue for my system now. The problem has most likely been a BIOS update in my system. It is running on a Asus Z170A which hat a Bios version 1602 from 2016 before. I updated now to BIOS version 3802 from 15.3.2018 - seems to be the latest available for this board. After updating tumbleweed - everything but the kernel, I now released the kernel lock and updated the kernel to 6.6.2-1-default. Then, using the "vmware-host-modules-workstation-17.5.0" (make clean, make, make install, systemctl start vmware) everything seems fine now! I can start my 2 VMware virtual machines (one Linux, one Windows) without any visible issues. Good luck for Joe now!

ja-jaa-org-uk commented 9 months ago

2023_12_05-13.10

Just to confirm Linux host lockups for Fedora-39 6.6.x kernels

[root@meon:/boot]$ ltr|grep vml -rwxr-xr-x. 1 root root 14560456 Nov 8 00:00 vmlinuz-6.5.11-300.fc39.x86_64 -rwxr-xr-x. 1 root root 14540552 Nov 20 00:00 vmlinuz-6.5.12-300.fc39.x86_64 -rwxr-xr-x. 1 root root 14661960 Nov 22 00:00 vmlinuz-6.6.2-201.fc39.x86_64 -rwxr-xr-x. 1 root root 14662792 Nov 28 00:00 vmlinuz-6.6.3-200.fc39.x86_64

Currently running this kernel which works fine. [root@meon:/boot]$ uname -a Linux meon.jaa.org.uk 6.5.12-300.fc39.x86_64

Failing kernels vmlinuz-6.6.2-201.fc39.x86_64 & 6.6.3-200.fc39.x86_64

vmware runs OK but as soon an attempt is made to start a virtual machine a physical power off is required to restore control of the host machine.

Both Windows 11 and Ubuntu 22.04 clients cause the crash.

John

rakotomandimby commented 9 months ago

@ja-jaa-org-uk , I recommend to report also the problem here: https://communities.vmware.com/t5/VMware-Workstation-Pro/Ubuntu-22-04-freezes-randomly-on-VMWare-Professional-17/td-p/2942773/page/3

sonarpm commented 9 months ago

Hi,

Just FYI: I had this same issue, I updated bios and used these modules and its working fine.

Pop OS Kernel 6.6.6 workstation v17.5.0

OldManRising commented 9 months ago

Hi everybody,

same situation over here, watching rcp cpu stalls everytime I powered up a VM on with Workstation 17.5.0. Anyway, moving on to Kernel 6.6.6.-1, doing the latest BIOS-Update for my ASUS Z170 PRO GAMING and compiling the modules again, seems to solve the issue. Running now for 24 hours with out trouble

ja-jaa-org-uk commented 9 months ago

Thanks for the heads up. Beelink GTR7 Pro Ryzen 9 7940HS Very quick test on Fedora 39 kernel 6.6.7, seems OK for Ubuntu 22 & Windows 11 clients. ja@meon GitHub 2$ uname -a Linux meon.jaa.org.uk 6.6.7-200.fc39.x86_64 /global/db/sw/VMware_17/mkubeck_17.5.0 John

dragnev-dev commented 8 months ago

Experiencing the same issue, fedora 39, Windows 10 guest, kernel 6.6.9. BIOS up to date.

Journals

> Jan 19 00:01:01 kernel: /dev/vmmon[7461]: PTSC: initialized at 2688000000 Hz using TSC, TSCs are synchronized. Jan 19 00:01:01 kernel: /dev/vmmon[7461]: Monitor IPI vector: 0 Jan 19 00:01:01 kernel: /dev/vmmon[7461]: HV IPI vector: 0 Jan 19 00:01:01 kernel: ------------[ cut here ]------------ Jan 19 00:01:01 kernel: WARNING: CPU: 4 PID: 7461 at kernel/rcu/tree_plugin.h:734 rcu_sched_clock_irq+0xb7e/0x1130 Jan 19 00:01:01 kernel: Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device nvidia_drm(POE) nvidia_modeset(POE) nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 > Jan 19 00:01:01 kernel: snd_compress snd_hda_codec_realtek ac97_bus intel_pmc_bxt snd_hda_scodec_cs35l41_spi libarc4 snd_hda_codec_generic mei_hdcp mei_pxp iTCO_vendor_support snd_pcm_dmaengine uvcvideo kvm snd_hda_codec_hdmi b> Jan 19 00:01:01 kernel: dm_crypt nvme nvme_core nvme_common i915 i2c_algo_bit rtsx_pci_sdmmc crct10dif_pclmul drm_buddy crc32_pclmul ttm crc32c_intel polyval_clmulni mmc_core hid_asus polyval_generic drm_display_helper asus_wmi> Jan 19 00:01:01 kernel: CPU: 4 PID: 7461 Comm: vmware-vmx Tainted: P D OE 6.6.11-200.fc39.x86_64 #\1 Jan 19 00:01:01 kernel: Hardware name: ASUSTeK COMPUTER INC. ROG G16 GU603ZU_GU603ZU/GU603ZU, BIOS GU603ZU.313 06/27/2023 Jan 19 00:01:01 kernel: RIP: 0010:rcu_sched_clock_irq+0xb7e/0x1130 Jan 19 00:01:01 kernel: Code: 38 08 00 00 85 c0 0f 84 ab f5 ff ff e9 b7 fc ff ff c6 87 39 08 00 00 01 e9 9a f5 ff ff 48 89 ef e8 27 a7 f3 ff e9 10 ff ff ff <0f> 0b e9 2e f5 ff ff be 03 00 00 00 e8 f1 d6 63 00 e9 fa fe ff ff Jan 19 00:01:01 kernel: RSP: 0018:ffffc900002acdf0 EFLAGS: 00010082 Jan 19 00:01:01 kernel: RAX: ffff888175da2900 RBX: 0000000000000000 RCX: 000000000c550a58 Jan 19 00:01:01 kernel: RDX: 00000000ffffffc2 RSI: ffffffffbc889732 RDI: ffff888175da2900 Jan 19 00:01:01 kernel: RBP: ffff888c80322280 R08: 0000000000000000 R09: 0000000000000000 Jan 19 00:01:01 kernel: R10: 0000000000000000 R11: ffffc900002acff8 R12: ffff888c80324d00 Jan 19 00:01:01 kernel: R13: ffffc9003440f8b8 R14: ffff888c80324d10 R15: ffff888c803247c0 Jan 19 00:01:01 kernel: FS: 00007f3ac2ed3c00(0000) GS:ffff888c80300000(0000) knlGS:0000000000000000 Jan 19 00:01:01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 19 00:01:01 kernel: CR2: 00007f3ab3218000 CR3: 00000002899c8000 CR4: 0000000000f50ee0 Jan 19 00:01:01 kernel: PKRU: 55555554 Jan 19 00:01:01 kernel: Call Trace: Jan 19 00:01:01 kernel: Jan 19 00:01:01 kernel: ? rcu_sched_clock_irq+0xb7e/0x1130 Jan 19 00:01:01 kernel: ? __warn+0x81/0x130 Jan 19 00:01:01 kernel: ? rcu_sched_clock_irq+0xb7e/0x1130 Jan 19 00:01:01 kernel: ? report_bug+0x171/0x1a0 Jan 19 00:01:01 kernel: ? handle_bug+0x3c/0x80 Jan 19 00:01:01 kernel: ? exc_invalid_op+0x17/0x70 Jan 19 00:01:01 kernel: ? asm_exc_invalid_op+0x1a/0x20 Jan 19 00:01:01 kernel: ? rcu_sched_clock_irq+0xb7e/0x1130 Jan 19 00:01:01 kernel: ? timekeeping_update+0xdd/0x130 Jan 19 00:01:01 kernel: ? timekeeping_advance+0x377/0x590 Jan 19 00:01:01 kernel: update_process_times+0x74/0xb0 Jan 19 00:01:01 kernel: tick_sched_handle+0x21/0x60 Jan 19 00:01:01 kernel: tick_sched_timer+0x6f/0x90 Jan 19 00:01:01 kernel: ? __pfx_tick_sched_timer+0x10/0x10 Jan 19 00:01:01 kernel: __hrtimer_run_queues+0x10f/0x2b0 Jan 19 00:01:01 kernel: hrtimer_interrupt+0xf8/0x230 Jan 19 00:01:01 kernel: __sysvec_apic_timer_interrupt+0x4d/0x140 Jan 19 00:01:01 kernel: sysvec_apic_timer_interrupt+0x6d/0x90 Jan 19 00:01:01 kernel: Jan 19 00:01:01 kernel: Jan 19 00:01:01 kernel: asm_sysvec_apic_timer_interrupt+0x1a/0x20 Jan 19 00:01:01 kernel: RIP: 0010:__folio_throttle_swaprate+0x4/0xe0 Jan 19 00:01:01 kernel: Code: b0 ff ff 83 f8 f4 74 dd 31 c0 5b c3 cc cc cc cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 <0f> 1f 44 00 00 83 e6 40 55 53 48 8b 1f 75 07 5b 5d c3 cc cc cc cc Jan 19 00:01:01 kernel: RSP: 0018:ffffc9003440f960 EFLAGS: 00000246 Jan 19 00:01:01 kernel: RAX: 0000000000000000 RBX: ffffc9003440f9c8 RCX: 0000000000003543 Jan 19 00:01:01 kernel: RDX: ffff888175da2900 RSI: 0000000000000cc0 RDI: ffffea0009066380 Jan 19 00:01:01 kernel: RBP: 0000000000000001 R08: ffff888c8033a4c0 R09: ffff888c8033a4e8 Jan 19 00:01:01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88817b6d31f8 Jan 19 00:01:01 kernel: R13: ffffea0009066380 R14: 0000000000000000 R15: ffff888118bb8750 Jan 19 00:01:01 kernel: do_anonymous_page+0xc2/0x3b0 Jan 19 00:01:01 kernel: __handle_mm_fault+0xbe6/0xd90 Jan 19 00:01:01 kernel: handle_mm_fault+0x17f/0x360 Jan 19 00:01:01 kernel: do_user_addr_fault+0x1ed/0x660 Jan 19 00:01:01 kernel: exc_page_fault+0x7f/0x180 Jan 19 00:01:01 kernel: asm_exc_page_fault+0x26/0x30 Jan 19 00:01:01 kernel: RIP: 0010:copyout+0x1b/0x30 Jan 19 00:01:01 kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 d0 48 89 d1 31 d2 48 01 f8 0f 92 c2 48 85 c0 78 10 48 85 d2 75 0b 0f 01 cb a4 0f 1f 00 0f 01 ca 89 c8 c3 cc cc cc cc 66 0f 1f 44 00 00 90 Jan 19 00:01:01 kernel: RSP: 0018:ffffc9003440fbc8 EFLAGS: 00050246 Jan 19 00:01:01 kernel: RAX: 00007f3ab3218010 RBX: 0000000000001000 RCX: 0000000000000010 Jan 19 00:01:01 kernel: RDX: 0000000000000000 RSI: ffff88820462eff0 RDI: 00007f3ab3218000 Jan 19 00:01:01 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000132b000 Jan 19 00:01:01 kernel: R10: 000000000000000f R11: 000000000132b000 R12: ffffc9003440fda0 Jan 19 00:01:01 kernel: R13: 0000000000001000 R14: ffff88820462e000 R15: 0000000000000000 Jan 19 00:01:01 kernel: _copy_to_iter+0x5e/0x4a0 Jan 19 00:01:01 kernel: ? get_page_from_freelist+0x15ee/0x1760 Jan 19 00:01:01 kernel: copy_page_to_iter+0x8b/0x140 Jan 19 00:01:01 kernel: filemap_read+0x1cd/0x350 Jan 19 00:01:01 kernel: vfs_read+0x1fe/0x350 Jan 19 00:01:01 kernel: ksys_read+0x6f/0xf0 Jan 19 00:01:01 kernel: do_syscall_64+0x5d/0x90 Jan 19 00:01:01 kernel: ? __count_memcg_events+0x42/0x90 Jan 19 00:01:01 kernel: ? __fget_light+0x99/0x100 Jan 19 00:01:01 kernel: ? ksys_lseek+0x89/0xb0 Jan 19 00:01:01 kernel: ? syscall_exit_to_user_mode+0x2b/0x40 Jan 19 00:01:01 kernel: ? do_syscall_64+0x6c/0x90 Jan 19 00:01:01 kernel: ? exc_page_fault+0x7f/0x180 Jan 19 00:01:01 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Jan 19 00:01:01 kernel: RIP: 0033:0x7f3ac2b2619a Jan 19 00:01:01 kernel: Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 88 28 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 e2 28 f8 ff 48 8b Jan 19 00:01:01 kernel: RSP: 002b:00007fff268b6670 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 Jan 19 00:01:01 kernel: RAX: ffffffffffffffda RBX: 0000000000553f88 RCX: 00007f3ac2b2619a Jan 19 00:01:01 kernel: RDX: 0000000000553f88 RSI: 00007f3ab320c010 RDI: 000000000000004f Jan 19 00:01:01 kernel: RBP: 00007fff268b6690 R08: 0000000000000000 R09: 0000000000000000 Jan 19 00:01:01 kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000553f88 Jan 19 00:01:01 kernel: R13: 0000000000000027 R14: 00007f3ab320c010 R15: 0000000000000001 Jan 19 00:01:01 kernel: Jan 19 00:01:01 kernel: ---[ end trace 0000000000000000 ]--- Jan 19 00:01:07 abrt-dump-journal-oops[1980]: abrt-dump-journal-oops: Found oopses: 1 Jan 19 00:01:07 abrt-dump-journal-oops[1980]: abrt-dump-journal-oops: Creating problem directories Jan 19 00:01:07 abrt-server[7859]: Can't find a meaningful backtrace for hashing in '.' Jan 19 00:01:07 abrt-server[7859]: Deleting non-reportable oops '.' because DropNotReportableOopses is set to 'yes' Jan 19 00:01:07 abrt-server[7859]: 'post-create' on '/var/spool/abrt/oops-2024-01-19-10:22:07-1980-0' exited with 1 Jan 19 00:01:07 abrt-server[7859]: Deleting problem directory '/var/spool/abrt/oops-2024-01-19-10:22:07-1980-0' Jan 19 00:01:07 abrt-server[7859]: Lock file '.lock' was locked by process 7875, but it crashed? Jan 19 00:01:08 abrt-dump-journal-oops[1980]: Reported 1 kernel oopses to Abrt Jan 19 00:01:32 kernel: x86/split lock detection: #AC: vmx-vcpu-3/7500 took a split_lock trap at address: 0x5563a5b0fa6a Jan 19 00:01:37 kernel: x86/split lock detection: #AC: vmx-vcpu-2/7499 took a split_lock trap at address: 0x5563a5b0fa6a Jan 19 00:01:38 kernel: x86/split lock detection: #AC: vmx-vcpu-1/7498 took a split_lock trap at address: 0x5563a5b0fa6a Jan 19 00:01:39 kernel: x86/split lock detection: #AC: vmx-vcpu-0/7497 took a split_lock trap at address: 0x5563a5b0fa6a Jan 19 00:01:51 systemd[1]: systemd-timedated.service: start operation timed out. Terminating. Jan 19 00:02:07 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 5-...D } 61172 jiffies s: 1909 root: 0x1/. Jan 19 00:02:07 kernel: rcu: blocking rcu_node structures (internal RCU debug): l=1:0-9:0x20/. Jan 19 00:02:07 kernel: Sending NMI from CPU 1 to CPUs 5: Jan 19 00:02:07 kernel: NMI backtrace for cpu 5 skipped: idling at intel_idle+0x62/0xb0

rakotomandimby commented 8 months ago

Fix of missing prototypes in https://github.com/mkubecek/vmware-host-modules/commit/2c6d66f3f1947384038b765c897b102ecdb18298 seemed to have solved several issues. I recommend everyone to upgrade

JoeSalmeri commented 8 months ago

Updated information from when I originally reported this.

I am now running TW 20231228 and using kernel 6.6.7-1.

I just downloaded and installed the latest 17.5.0 modules with the fixes discussed above and recompiled, signed and tested out vmware.

The service starts fine, VM comes up and appears to work fine, but when doing the shutdown vmware coredumps. It does not crash linux or seem to cause any other issues, however, the resulting journal errors are different now.

Jan 24 14:04:55 Server systemd-coredump[27317]: [??] Process 27315 (vmplayer) of user 1000 dumped core.

Module libcds.so without build-id.
Stack trace of thread 27315:
#0  0x00007fe93c7161bd syscall (libc.so.6 + 0x1161bd)
#1  0x00007fe93884d723 n/a (libvmwarebase.so + 0x24d723)
#2  0x00007fe93884da59 n/a (libvmwarebase.so + 0x24da59)
#3  0x00007fe938753993 Panic_Panic (libvmwarebase.so + 0x153993)
#4  0x00007fe938753a2c Panic (libvmwarebase.so + 0x153a2c)
#5  0x00007fe93884c510 n/a (libvmwarebase.so + 0x24c510)
#6  0x00007fe93884d337 n/a (libvmwarebase.so + 0x24d337)
#7  0x00007fe93c63f190 __restore_rt (libc.so.6 + 0x3f190)
#8  0x00007fe937660fa0 _ZNK3cui3MKS22GetGuestTopologyLimitsERjS1_S1_S1_Rc (libvmwareui.so + 0x1060fa0)
#9  0x00007fe93758c67d _ZN3cui19IsTopologySupportedERKNS_2VMERKSt6vectorINS_4RectESaIS4_EERbS9_ (libvmwareui.so + 0xf8c67d)
#10 0x00007fe93734ef63 _ZNK3cui13FullscreenMgr18CompatibleTopologyEPKNS_2VMERKSt6vectorIjSaIjEERbS9_ (libvmwareui.so + 0xd4ef63)
#11 0x00007fe93734ff88 _ZN3cui13FullscreenMgr11CanMultiMonEPNS_2VMEPSt6vectorIN3utf6stringESaIS5_EEb (libvmwareui.so + 0xd4ff88)
#12 0x00007fe937a0cc4a _ZN3lui13FullscreenMgr11CanMultiMonEPN3cui2VMEPSt6vectorIN3utf6stringESaIS6_EEb (libvmwareui.so + 0x140cc4a)
#13 0x00007fe93bb9fb1b _ZN6player6Window16UpdateAddMonitorEv (libvmplayer.so + 0x11eb1b)
#14 0x00007fe9375a971d _ZNK3cui10Capability8EvaluateEv (libvmwareui.so + 0xfa971d)
#15 0x00007fe9375a9947 _ZN3cui10Capability18OnTestDisconnectedEPv (libvmwareui.so + 0xfa9947)
#16 0x00007fe937330e11 _ZN3cui16ClearConnectionsISt4listIN4sigc10connectionESaIS3_EEEEvRT_ (libvmwareui.so + 0xd30e11)
#17 0x00007fe9374a6ef7 _ZN3cui14VMCapabilities10ConnectMKSEv (libvmwareui.so + 0xea6ef7)
#18 0x00007fe937473aaa _ZN3cui2VM8UnsetMKSEPNS_3MKSE (libvmwareui.so + 0xe73aaa)
#19 0x00007fe937a5f0dd _ZN3lui2VM8UnsetMKSEPN3cui3MKSE (libvmwareui.so + 0x145f0dd)
#20 0x00007fe93bb679ed _ZN6player6Player7CloseVMEN4sigc4slotIvbRKN3cui5ErrorENS1_3nilES7_S7_S7_S7_EENS2_IvS7_S7_S7_S7_S7_S7_S7_EE (libvmplayer.so + 0xe69ed)
#21 0x00007fe93bb6f4a2 _ZN4sigc8internal10slot_call2INS_18bound_mem_functor2IvN6player6PlayerENS_4slotIvbRKN3cui5ErrorENS_3nilESA_SA_SA_SA_EENS5_IvSA_SA_SA_SA_SA_SA_SA_EEEEvSB_SC_E7call_itEPNS0_8slot_repERKSB_RKSC_ (libvmplayer.so + 0xee4a2)
#22 0x00007fe9376f3faa _ZN3cui15LoggedSlotChain11SlotWrapperEN4sigc4slotIvbRKNS_5ErrorENS1_3nilES6_S6_S6_S6_EENS2_IvS6_S6_S6_S6_S6_S6_S6_EERKN3utf6stringENS2_IvS7_S8_S6_S6_S6_S6_S6_EE (libvmwareui.so + 0x10f3faa)
#23 0x00007fe9376f4b4b _ZN4sigc8internal10slot_call2INS_12bind_functorILin1ENS_18bound_mem_functor4IvN3cui15LoggedSlotChainENS_4slotIvbRKNS4_5ErrorENS_3nilESA_SA_SA_SA_EENS6_IvSA_SA_SA_SA_SA_SA_SA_EERKN3utf6stringENS6_IvSB_SC_SA_SA_SA_SA_SA_EEEESE_SH_SA_SA_SA_SA_SA_EEvSB_SC_E7call_itEPNS0_8slot_repERKSB_RKSC_ (libvmwareui.so + 0x10f4b4b)
#24 0x00007fe9375b34b8 _ZN3cui9SlotChain8NextSlotEj (libvmwareui.so + 0xfb34b8)
#25 0x00007fe93bb6c19a _ZN4sigc8internal10slot_call0INS_19bind_return_functorIbNS_4slotIvNS_3nilES4_S4_S4_S4_S4_S4_EEEEbE7call_itEPNS0_8slot_repE (libvmplayer.so + 0xeb19a)
#26 0x00007fe93b64a93d n/a (libglibmm-2.4.so.1 + 0x4a93d)
#27 0x00007fe93b997924 n/a (libglib-2.0.so.0 + 0x5e924)
#28 0x00007fe93b994f30 n/a (libglib-2.0.so.0 + 0x5bf30)
#29 0x00007fe93b996b58 n/a (libglib-2.0.so.0 + 0x5db58)
#30 0x00007fe93b99742f g_main_loop_run (libglib-2.0.so.0 + 0x5e42f)
#31 0x00007fe93a7f6a9d gtk_main (libgtk-3.so.0 + 0x1f6a9d)
#32 0x00007fe93bb31fea main (libvmplayer.so + 0xb0fea)
#33 0x0000564a3808ba50 n/a (appLoader + 0x1ca50)
#34 0x0000564a38087ba0 n/a (appLoader + 0x18ba0)
#35 0x00007fe93c6281b0 __libc_start_call_main (libc.so.6 + 0x281b0)
#36 0x00007fe93c628279 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x28279)
#37 0x0000564a38088045 n/a (appLoader + 0x19045)
ELF object binary architecture: AMD x86-64

So to summarize:

Using the 6.6.7.1 kernel and the 17.5.0 modules compiled from what I downloaded from here back on 11/19/2023 and using the 6.6.7.1 kernel and the 17.5.0 modules compiled from what I downloaded here today produce the above coredump when the vm is shutdown.

So still get a coredump at shutdown but the coredump I now get is different with the 6.6.7.1 kernel than it was back when originally reported and using the 6.6.1-1.1 kernel.

Since the latest 17.5.0 modules seem to work with the only issue being the coredump at shutdown I will leave them installed and see if any other issues occur.

Hope that is helpful

Joe

mkubecek commented 8 months ago

This is a userspace application crash, I cannot really help you with that. Looks very similar to what was discussed above on Nov 19-22, as far as I can say.

JoeSalmeri commented 8 months ago

Yeah, when I saw libvmwarebase.so I knew it in the closed source part but figured I'd post an update for everyone that has been part of this thread.

I'm curious, what distro and kernel do you use ?

mkubecek commented 8 months ago

I'm using Leap 15.5 but with a newer kernel. At the moment, it's 6.7, essentially the same as Tumbleweed (or what TW is going to get soon, I'm not sure). But I plan to test 6.8-rc1 later this evening or tomorrow morning.

JoeSalmeri commented 8 months ago

So you see the same issues talked about here on Leap right ? ( since the bug is in the vmware closed source )

mkubecek commented 8 months ago

No, I haven't seen those yet. But I have VMware Player (17.5.0) on this machine, I only have Workstation on another which I'm using remotely most of the time.

raintonr commented 8 months ago

I'm on VMWare Player 17.5.0 & kernel 6.6.9-200 (fc39). Currently vmplayer causes cpu hangs. It does actually work but other processes start to misbehave and I end up having to hard reset the host machine because it won't even shutdown cleanly afterwards.

The service starts fine, VM comes up and appears to work fine, but when doing the shutdown vmware coredumps. It does not crash linux or seem to cause any other issues...

So that would be preferable to what's happening here!

mkubecek commented 8 months ago

I'm on VMWare Player 17.5.0 & kernel 6.6.9-200 (fc39). Currently vmplayer causes cpu hangs. It does actually work but other processes start to misbehave and I end up having to hard reset the host machine because it won't even shutdown cleanly afterwards.

Does this happen with modules built from (up to date) source from this repository or with unpatched modules from VMware?

raintonr commented 8 months ago

Does this happen with modules built from (up to date) source from this repository or with unpatched modules from VMware?

With the unpatched stock modules. I can't find any branch or tag relating to p17.5.0 here or would try it. Or did I miss something?

mkubecek commented 8 months ago

The modules have been exactly the same in Workstation and Player for years so starting with 17.0.0, I no longer maintain two branches with identical content. Just use the head of workstation-17.5.0 branch also for Player. (It is mentioned in the INSTALL file.)

raintonr commented 8 months ago

Just use the head of workstation-17.5.0 branch also for Player.

That's much better. Got the core-dump on guest OS (Windows 10) shutdown but no more CPU hangs.

Many thanks for your help :smiley:

JoeSalmeri commented 8 months ago

I'm on VMWare Player 17.5.0 & kernel 6.6.9-200 (fc39). Currently vmplayer causes cpu hangs. It does actually work but other processes start to misbehave and I end up having to hard reset the host machine because it won't even shutdown cleanly afterwards.

The service starts fine, VM comes up and appears to work fine, but when doing the shutdown vmware coredumps. It does not crash linux or seem to cause any other issues...

So that would be preferable to what's happening here!

Sounds like you are using vmmon/vmnet modules provided with VMWare 17.5.0 as that is the behavior I also so.

Installing the vmmon/vmnet modules from here fixed that and I only have the coredump at shutdown issue now, although, I am using kernel 6.6.7-1 right now but will be updating to a newer TW build using kernel 6.7.1-2.1 sometime shortly after the start of next month.

willzyx-hub commented 8 months ago

The modules have been exactly the same in Workstation and Player for years so starting with 17.0.0, I no longer maintain two branches with identical content. Just use the head of workstation-17.5.0 branch also for Player. (It is mentioned in the INSTALL file.)

i have the same issue, i did used your VMware Modules from your latest Branch (17.5.0) but after starting the guest the host kernel seems to acting weird (such as TTY doesn't work, can't shutdown,etc) here's the dmesg

kernelerror.log

JoeSalmeri commented 8 months ago

@mkubecek

No, I haven't seen those yet. But I have VMware Player (17.5.0) on this machine, I only have Workstation on another which I'm using remotely most of the time.

Interesting I wonder why you are not seeing the issue on Leap 15.5 with the newer 6.7 kernel ?

I am working on a different / unrelated issue and suse support created a special 6.7.2 kernel for me to test out the other issue and since I had it installed I tried vmware with that and it also coredumps when you shutdown the vm but like with the other kernels the vm runs fine until shutdown.

mkubecek commented 8 months ago

Interesting I wonder why you are not seeing the issue on Leap 15.5 with the newer 6.7 kernel ?

As I said before, this rather looks like a userspace problem and Leap 15.5 userspace is quite different from Tumbleweed.

oscarfv commented 7 months ago

Just for the record, I've experienced the problems described here on Debian Testing with kernel 6.6.15 / motherboard ASUS Z170A BIOS 3802 / VMware Workstation 17.5 and up-to-date modules from this repo (thanks @mkubecek !). The system becomes increasingly unusable after closing a VM, eventually requiring a hard reset.

IIUC, with up-to-date modules I should just get a coredump at shutdown, but that's not what I'm experiencing.

I'll look into downgrading to kernel 6.5 for now.

ja-jaa-org-uk commented 7 months ago

Just for info Fedora 39 - 6.7.6-200.fc39.x86_64 Updated to 17.5.1 hoping that things had been fixed. Ran Windows 11 VM without mkubeck improvements - host locked up but I think only at shutdown of VM. See errors attached. Ran mkubeck for 1.17.1 and things appeared to be OK - thanks again!

Errors_Raw_17.5.1_6.7.6-200.fc39.zip

mkubecek commented 7 months ago

Updated to 17.5.1 hoping that things had been fixed.

Unfortunately not. There was no update of the modules source, 17.5.1 has exactly the same modules as 17.5.0.

MarkTr commented 6 months ago

Today on my tumbleweed kernel 6.8.1 arrived which does solve this issue for me. With 6.7.x kernels running a local VM, did always break the host system on various places (e.g. firefox, sudo) and didn't let me shutdown the host completely. No matter if I used original kernel modules or the ones from this repository (on 6.8 I have to use the latter).

JoeSalmeri commented 4 months ago

As the person that originally opened this issue, I thought I would post an update which others struggling with these issues may find helpful.

I have used VMWare products since around VMWare 2.0 or 3.0......so quite a long time.

I have worked as a System Administrator ( mostly Windows but also some Linux and Solaris ), a database administrator ( Oracle and MS Sql Server ) and a Network Administrator with customers all over the US for 25+ years.

Overall I was pretty happy with VMware, however, up until about 3 years ago I was always using a Windows host and Linux guests.

At that point, I decided to make the switch to Linux ( openSUSE Tumbleweed ) for those that are care.

I had grown tired of issues not getting resolved on Windows and after testing out my workflow on Linux I made the switch.

In order to make the move easier, I stuck with VMware since I could easily move my vms over to the Linux environment and the only new issue was the need to compile the kernel modules when new TW builds came out.

That worked ok, but there were always various issues, like the shutdown one discussed here as well as others over time.

Then Broadcomm acquired vmware and started making changes to how licensing worked and having past experience with another company they acquired it was concerning to me.

So I decided to spend some time looking into using KVM since the modules are part of the kernel and it was also supposed to perform better.

That was 2 months ago.

My initial tests went well AND it was pretty easy to migrate ALL my vms over.

There are tools that can help with this but instead I did the following:

1) Copy the vms *.vmdk file to the storage location for my KVM disk images 2) Create a new KVM vm and point it to the disk image

There are a few little nuances ( like arch linux didn't support secure boot and it took me a little while to figure out that was the issue ) but overall it was really that simple.

After running that way for a little while, I used qemu-img to convert the copied vmdk files to qcow2 files because other features are available when you use qcow2 files.

After getting everything up and running I decided to see how far I could push this new computer ( i7-14700k 64 GB memory ).

I started up 12 kvms ( 10 Linux different distros + 1 Windows 10 and 1 Windows 11 ) and started distro updates / windows updates on all of them at the same time.

At the same time 2 users were remote desktop into my PC doing various tasks, I was remote desktop to another server, and there was a 60 GB file transfer occurring on the network.

None noticed ANY performance degradation or even that all that was going on.

To push the system further, I installed and ran the s-tui tool in stress test mode so all 28 cores were pegged at 100%.

I was quite shocked to see that even with that load all the vms ran smoothly and continued their updates and no users noticed any performance issues.

I'll don't think I'll be going back VMWare :-)

I DEFINITELY appreciate all of @mkubecek efforts to provide these modules for Linux to address issues that vmware / broadcomm hasn't fixed yet but if you are using a Linux host, I would seriously consider looking into KVM.

NOTE: You cannot run multiple hypervisors at the same time, BUT, you can have BOTH vmware and KVM installed at the same time.

Just shutdown the vmware services and modprobe -r vmmon and vmnet modules and then you can use KVM and reverse the process if you need to switch back to vmware.

Hope this is helpful to others !

Joe