intel / ethernet-linux-ice

GNU General Public License v2.0
8 stars 3 forks source link

Recurring card resets after ICE OICR event notification: oicr = 0x06000803 / HMC Error on 3 seperate machines #12

Open secabeen opened 1 week ago

secabeen commented 1 week ago

I have three SYS-111C-NR servers with AOC-A25G-i2SM NICs that leverage the Intel E810-XXVAM2 controller under Proxmox PVE using the ice driver that are regularly resetting the NIC with a OICR / HMC error when loaded.

Issue Description These issues are intermittent, but regular, and occur on all three systems.

Driver Version ice-1.14.11, but also occurred on older versions, including the driver included with Linux Kernel 6.8.12.

Custom Code No

Reproduction Steps

1. Run somewhat network intensive VMs on latest Proxmox PVE.
2. Wait some hours to days.
3. Observe NIC reset.

Expected Behavior Card does not reset.

Actual Behavior Card is reset by driver and interrupts established connections.

Additional Information Other users report similar issues on other OEM hardware based around the E810-XXVAM2, and also report that the issue does not appear on other Hypervisors such as ESXi. One report shows improvement after BIOS update and card replacement. We have three servers, which reduces likelihood of individual card failure. BIOS updates through May 2024 were installed in an attempt to resolve the problem, without success.

Details of other report: https://forum.proxmox.com/threads/network-crash-after-3-or-4-hours.122328/

Relevant Log Output

[Thu Oct 17 09:38:49 2024] ice 0000:c7:00.0 irdma1: ICE OICR event notification: oicr = 0x06000803
[Thu Oct 17 09:38:49 2024] ice 0000:c7:00.0 irdma1: HMC Error
[Thu Oct 17 09:38:49 2024] ice 0000:c7:00.0 irdma1: Requesting a reset
[Thu Oct 17 09:38:49 2024] infiniband irdma1: ib_query_port failed (-19)
[Thu Oct 17 09:38:49 2024] irdma_dbg_pf_exit: removing debugfs entries
[Thu Oct 17 09:38:49 2024] dmar_fault: 311 callbacks suppressed
[Thu Oct 17 09:38:49 2024] DMAR: DRHD: handling fault status reg 2
[Thu Oct 17 09:38:49 2024] DMAR: [DMA Read NO_PASID] Request device [c7:00.0] fault addr 0x844b7000 [fault reason 0x71] SM: Present bit in first-level paging entry is clear
[Thu Oct 17 09:38:49 2024] DMAR: DRHD: handling fault status reg 2
[Thu Oct 17 09:38:49 2024] DMAR: [DMA Read NO_PASID] Request device [c7:00.0] fault addr 0x844b7000 [fault reason 0x71] SM: Present bit in first-level paging entry is clear
[Thu Oct 17 09:38:49 2024] DMAR: DRHD: handling fault status reg 2
[Thu Oct 17 09:38:49 2024] DMAR: [DMA Read NO_PASID] Request device [c7:00.0] fault addr 0x844b7000 [fault reason 0x71] SM: Present bit in first-level paging entry is clear
[Thu Oct 17 09:38:49 2024] DMAR: DRHD: handling fault status reg 2
[Thu Oct 17 09:38:49 2024] vmbr0: port 1(enp199s0f0np0) entered disabled state
[Thu Oct 17 09:38:49 2024] ice 0000:c7:00.1: Failed to  disable iWARP filtering
[Thu Oct 17 09:38:49 2024] enp199s0f1np1 speed is unknown, defaulting to 1000
[Thu Oct 17 09:38:49 2024] infiniband irdma0: ib_query_port failed (-19)
[Thu Oct 17 09:38:49 2024] irdma_dbg_pf_exit: removing debugfs entries
[Thu Oct 17 09:38:49 2024] ice 0000:c7:00.1 irdma0: WS: LAN free_res for rdma qset failed.
[Thu Oct 17 09:38:51 2024] ice 0000:c7:00.1: The DDP package was successfully loaded: ICE OS Default Package version 1.3.36.0
[Thu Oct 17 09:38:51 2024] ice 0000:c7:00.1: PTP reset successful
[Thu Oct 17 09:38:51 2024] ice 0000:c7:00.0: DDP package already present on device: ICE OS Default Package version 1.3.36.0
[Thu Oct 17 09:38:51 2024] ice 0000:c7:00.0: PTP reset successful
[Thu Oct 17 09:38:52 2024] ice 0000:c7:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
[Thu Oct 17 09:38:52 2024] ice 0000:c7:00.1: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
[Thu Oct 17 09:38:52 2024] probe: cdev_info=00000000ab54a062, cdev_info->dev.aux_dev.bus->number=199, cdev_info->rdma_active_port=0xff netdev=enp199s0f1np1
[Thu Oct 17 09:38:52 2024] ice 0000:c7:00.1: irdma_fill_device_info: iwdev->lag_mode = 0
[Thu Oct 17 09:38:55 2024] ice 0000:c7:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF
[Thu Oct 17 09:38:55 2024] vmbr0: port 1(enp199s0f0np0) entered blocking state
[Thu Oct 17 09:38:55 2024] vmbr0: port 1(enp199s0f0np0) entered forwarding state
[Thu Oct 17 09:38:55 2024] ice 0000:c7:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
[Thu Oct 17 09:38:55 2024] probe: cdev_info=000000004ce78438, cdev_info->dev.aux_dev.bus->number=199, cdev_info->rdma_active_port=0xff netdev=enp199s0f0np0
[Thu Oct 17 09:38:55 2024] ice 0000:c7:00.0: irdma_fill_device_info: iwdev->lag_mode = 0
[Thu Oct 17 09:38:55 2024] ice 0000:c7:00.0 enp199s0f0np0: NIC Link is Down
[Thu Oct 17 09:38:56 2024] ice 0000:c7:00.0 enp199s0f0np0: NIC Link is up 10 Gbps Full Duplex, Requested FEC: NONE, Negotiated FEC: NONE, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: None
firth commented 1 week ago

Same here, on 4 separate machines.

It happens when there is a large spike in network traffic.

noppel commented 1 week ago

I had the same problems. I have a supermicro server and various other pcs (shuttle). The problem is the same everywhere. It is not supermicro related. However I have problems with three or every SAS controllers brand I have tried. They are being reset every few days. Sometimes after a month, sometimes after 24 hours. I have had contact with supermicro and I have basically updated the SuperMicro BIOS, the SAS Adapter BIOS and the Intel i810 Bios (was quite complicated). What I am basically doing for now is I have a script listening on the journal and doing a reboot if "HMC Error" occurs. For the SAS issue, I have no sultion yet (maybe the supermicro enclosure is faulty). I am also using the kernel driver from intel (instead or additional to the kernel one). Fortunately the "HMC Error" did no longer occur with me so there is a high probabilty that the same will happen in your scenario, if you are either using the latest 6.8 kernel (standard proxmox), Intel driver or having done the SM or Intel BIOS updates.

igor-martynov commented 1 week ago

the same problems, card Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02). driver ice, Debian 12.5, Proxmox 8.1.4

lczapnik commented 21 hours ago

From the log attached I can see an irdma driver is requesting a reset. Do you run RDMA traffic? If not you may try to unload (and blacklist) irdma driver.

I asked internal validation teams if they observed similar issues.

Thank you for raising the issue!