Open secabeen opened 1 week ago
Same here, on 4 separate machines.
It happens when there is a large spike in network traffic.
I had the same problems. I have a supermicro server and various other pcs (shuttle). The problem is the same everywhere. It is not supermicro related. However I have problems with three or every SAS controllers brand I have tried. They are being reset every few days. Sometimes after a month, sometimes after 24 hours. I have had contact with supermicro and I have basically updated the SuperMicro BIOS, the SAS Adapter BIOS and the Intel i810 Bios (was quite complicated). What I am basically doing for now is I have a script listening on the journal and doing a reboot if "HMC Error" occurs. For the SAS issue, I have no sultion yet (maybe the supermicro enclosure is faulty). I am also using the kernel driver from intel (instead or additional to the kernel one). Fortunately the "HMC Error" did no longer occur with me so there is a high probabilty that the same will happen in your scenario, if you are either using the latest 6.8 kernel (standard proxmox), Intel driver or having done the SM or Intel BIOS updates.
the same problems, card Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02). driver ice, Debian 12.5, Proxmox 8.1.4
From the log attached I can see an irdma driver is requesting a reset. Do you run RDMA traffic? If not you may try to unload (and blacklist) irdma driver.
I asked internal validation teams if they observed similar issues.
Thank you for raising the issue!
I have three SYS-111C-NR servers with AOC-A25G-i2SM NICs that leverage the Intel E810-XXVAM2 controller under Proxmox PVE using the
ice
driver that are regularly resetting the NIC with a OICR / HMC error when loaded.Issue Description These issues are intermittent, but regular, and occur on all three systems.
Driver Version ice-1.14.11, but also occurred on older versions, including the driver included with Linux Kernel 6.8.12.
Custom Code No
Reproduction Steps
Expected Behavior Card does not reset.
Actual Behavior Card is reset by driver and interrupts established connections.
Additional Information Other users report similar issues on other OEM hardware based around the E810-XXVAM2, and also report that the issue does not appear on other Hypervisors such as ESXi. One report shows improvement after BIOS update and card replacement. We have three servers, which reduces likelihood of individual card failure. BIOS updates through May 2024 were installed in an attempt to resolve the problem, without success.
Details of other report: https://forum.proxmox.com/threads/network-crash-after-3-or-4-hours.122328/
Relevant Log Output