vmware / open-vm-tools

Official repository of VMware open-vm-tools project
http://sourceforge.net/projects/open-vm-tools/
2.25k stars 425 forks source link

nvme I/O timeout, aborting warning messages in journal and short I/O hang on Fedora guest VM Windows host #579

Open hermidalc opened 2 years ago

hermidalc commented 2 years ago

Describe the bug

I see quite often warnings messages like nvme nvme0: I/O 34 QID 13 timeout, aborting in the journal and it correlates with I/O appearing to hang a bit. Why is this happening?

$ sudo journalctl --boot=0 | grep nvme
Mar 23 09:54:51 dubbi kernel: nvme nvme0: pci function 0000:13:00.0
Mar 23 09:54:51 dubbi kernel: nvme nvme0: 15/0/0 default/read/poll queues
Mar 23 09:54:51 dubbi kernel:  nvme0n1: p1 p2 p3
Mar 23 09:54:51 dubbi systemd-fsck[541]: /dev/nvme0n1p3: clean, 1005458/32669696 files, 63842953/130655744 blocks
Mar 23 09:54:51 dubbi kernel: EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Mar 23 09:54:52 dubbi kernel: EXT4-fs (nvme0n1p3): re-mounted. Opts: (null). Quota mode: none.
Mar 23 09:54:53 dubbi systemd-fsck[755]: /dev/nvme0n1p2: clean, 43/65536 files, 73555/262144 blocks
Mar 23 09:54:53 dubbi kernel: EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Mar 23 09:54:53 dubbi systemd-fsck[756]: /dev/nvme0n1p1: 12 files, 1555/153296 clusters
Mar 23 09:59:36 dubbi kernel: nvme nvme0: I/O 128 QID 15 timeout, aborting
Mar 23 09:59:36 dubbi kernel: nvme nvme0: Abort status: 0x0
Mar 23 10:00:36 dubbi kernel: nvme nvme0: I/O 130 QID 15 timeout, aborting
Mar 23 10:00:36 dubbi kernel: nvme nvme0: Abort status: 0x0
Mar 23 10:01:33 dubbi kernel: nvme nvme0: I/O 132 QID 15 timeout, aborting
Mar 23 10:01:33 dubbi kernel: nvme nvme0: Abort status: 0x0
Mar 23 10:02:18 dubbi kernel: nvme nvme0: I/O 224 QID 8 timeout, aborting
Mar 23 10:02:18 dubbi kernel: nvme nvme0: Abort status: 0x0
Mar 23 10:02:50 dubbi kernel: nvme nvme0: I/O 226 QID 8 timeout, aborting
Mar 23 10:02:50 dubbi kernel: nvme nvme0: Abort status: 0x0
Mar 23 10:03:31 dubbi kernel: nvme nvme0: I/O 32 QID 5 timeout, aborting
Mar 23 10:03:31 dubbi kernel: nvme nvme0: Abort status: 0x0

My Fedora guest VM is running on an NVMe disk on latest VMware Workstation 16.2.3 Windows 10 host. I am using the latest firmware and drivers for the NVMe disk.

Reproduction steps

I can reproduce it on my setup by doing an rsync of very large files from one filesystem location to another on the Fedora guest VM

Expected behavior

No nvme timeout warning messages

Additional context

No response

PaTHml commented 2 years ago

NVMe storage issues are not related to open-vm-tools; Please open a service/support request with VMware support to diagnose the issue further.

hermidalc commented 2 years ago

NVMe storage issues are not related to open-vm-tools; Please open a service/support request with VMware support to diagnose the issue further.

VMware support is generally a big waste of time for general software issues, because they will think it’s a problem with your specific computer or setup and not a general issue with the software.

They also don’t give different expedited support to power users who know it’s not a problem with their computer or setup. I didn’t have this issue historically on the same computer, same NVMe drive, and same Windows install. I’ve seen this issue posted by many other users online. That’s why I don’t bother with VMware Support because I don’t want to spend cycles proving to them that it’s an issue on their side not mine.

PaTHml commented 2 years ago

We have informed the product team responsible for this issue.

As this is not related to open-vm-tools, please engage with the VMware Workstation community (https://communities.vmware.com/t5/VMware-Workstation/ct-p/3019-home) or VMware support service for further updates on this issue (the recommendation here).

General guess, based on the messages seen: I/O is slow or stuck, or an interrupt was missed. The abort succeeds and appears to clear the condition for a short time. All things pointing to storage issues.

As for the information likely to be needed by support: Guest OS:

Host OS:

VMware Workstation:

PaTHml commented 2 years ago

As an aside, the VMTN for Workstation has/had a similar issue thread, might be of help to you: https://communities.vmware.com/t5/VMware-Workstation-Pro/Workstation-Pro-16-NVMe-controller/m-p/2822786#M168257

hermidalc commented 2 years ago

As an aside, the VMTN for Workstation has/had a similar issue thread, might be of help to you: https://communities.vmware.com/t5/VMware-Workstation-Pro/Workstation-Pro-16-NVMe-controller/m-p/2822786#M168257

Exactly, I already saw this and multiple other threads online of users having the same problem. That's why I don't want to deal with VMware support because they will make me go through wasted time and cycles assuming first it's specific to my box and setup when clearly it's not it's a general VMware issue.

Summary from the issue thread above - something is generally going wrong with the VMware Workstation 16 virtual NVMe adapter and it needs to be fixed.

RevAngel7 commented 2 years ago

Hello. I am not using VMware, but I have this issue on ubuntu with the latest mainline kernels.

I am using 5.19 and have that issue from 5.19.1 to the now recent, and used by me, 5.19.11 from ubuntu 22.04 mainline.

nvme nvme0: Abort status: 0x0 nvme nvme0: I/O 14 QID 2 timeout, aborting nvme nvme0: Abort status: 0x0 nvme nvme0: I/O 62 QID 2 timeout, aborting and so on...

I have these issues after I changed from a AMD 2400G with an AM3 AGESA (on X370 chipset) before 1.0.0.6 to a AMD 5600G on a AM3 with AGESA 1.2.0.7 (on A520 chipset). NVME drive is the same.

I also get an error message at bootup from the NVME: Device: /dev/nvme0, number of Error Log entries increased from 203 to 206 This counter rises +1 every poweroff (203-206 comes from an image backup I did with 3 manual power off's, so it seems every power off counts as one error count).

The NVME drive never breached the high temp count and shows zero errors and very little wear on SMART tests, since I use this drive for a simple "daily use" multimedia system.

When the "nvme nvme0: Abort status: 0x0" / "nvme nvme0: I/O 14 QID 2 timeout, aborting" errors occur, the system hangs for a while, no I/O operations get processed for up to 30 seconds.

I found an entry is from 2020 and it seems to be still an issue in late 2022, is there a global fix or a planned fix for the user or planned to kernel changes? ( https://github.com/clearlinux/distribution/issues/2121 )

Thank you for any reply and/or ideas.

johnwvmw commented 2 years ago

@RevAngel7 From your statement

I am not using VMware, but I have this issue on ubuntu with the latest mainline kernels.

it suggests that the problem(s) are not Workstation or open-vm-tools issues but something tied directly to the Linux kernel release and/or the NVME driver. Thanks for the 2020 bug reference.

Has the problem been raised with the Linux vendor(s)?

RevAngel7 commented 2 years ago

I totally get why my report is out of place. I really do. And since I consider myself more of a user than a tech savy person I also understand the reluctance to consider my comment a real issue.

This bug is still open, if I am reading it right.

The same issue on https://github.com/vmware/open-vm-tools/issues/579 , also unsolved.

The same issue on https://github.com/clearlinux/distribution/issues/2121 , also unsolved.

And there is my issue on ubuntu.

Three different kernels, linux brands, same issue. I thought bringing the people together who actually have the tech knowledge to get behind this issue might be helpful (but that's just me).

hermidalc commented 2 years ago

FYI this bug still occurs on Fedora 36 guest with Linux kernal 5.19, so not sure if it's a kernel bug

RevAngel7 commented 2 years ago

I don't even know where to report this for ubuntu, to be totally honest. Like I said, I am a user who just stumbled over his ubuntu logs and googled them, found the entries here and Clear Linux distro. Do you have any suggestion what to do to help solving this riddle?

edit** Found the ubuntu launch pad for reporting a bug, trying to fill out a (hopefully not completely incompetent) bug report there.

RevAngel7 commented 2 years ago

Posted it on https://bugs.launchpad.net/launchpad/+bug/1991291 fyi

johnwvmw commented 2 years ago

Launchpad is the bug tracking system used by Ubuntu. Also a search of "Ubuntu how to file a problem report" led me to https://help.ubuntu.com/community/ReportingBugs. This provides some additional info about bug reporting tools that make it easy to capture crash dumps and system dumps for upload, if needed.

hermidalc commented 2 years ago

Actually I'm probably wrong, I've searched for this issue across Google and see it mentioned in multiple Linux dists without mentioning VMware. So yes could be a Linux kernel issue. Others have also mentioned still present in 5.19

SIMULATAN commented 1 year ago

Sorry for necroposting, I personally experience this issue on stock Arch on physical hardware, also since ~5.19, so +1 for a linux kernel bug from me. (it just happened to me on 6.0.8, the issue seems to still be present nowadays)

UPDATE 28.11.: I upgraded to 6.0.9 10 days ago and I haven't encountered the issue once since then

UPDATE 6.2.2023: this is still an issue and occurs regularly (like once a week)

RevAngel7 commented 1 year ago

No problem, I really guess the issue source is unknown or sporadic. You mentioned 6.0.8, what distro are you using? I am on Ubuntu mainline, and the issue left with the first 6.0 kernel I installed, 6.0.3. And it stayed away until now, on 6.0.8.

So we are using the same kernel version, my issue gone, yours still there, hmm. Weird.

OSS542 commented 1 year ago

This is still an issue in kernel 6.1 - see https://bugzilla.kernel.org/show_bug.cgi?id=216809