realganfan / r8125-esxi

Realtek RTL8125 driver for ESXi 6.7
GNU General Public License v3.0
203 stars 67 forks source link

drops off network #13

Open Rolzzz opened 2 years ago

Rolzzz commented 2 years ago

Hello there, thank you for providing the drivers. Unfortunately if I do large file transfers (40GB vmdk file) my esxi host drops off the network. I have to unplug the network cable and then back in for connectivity to resume... tried other cables and switches but the issue persists.

esxcli network nic get -n vmnic0

Advertised Auto Negotiation: true Advertised Link Modes: 10BaseT/Half, 10BaseT/Full, 100BaseT/Half, 100BaseT/Full, 1000BaseT/Full, 2500BaseX/Full Auto Negotiation: true Cable Type: Twisted Pair Current Message Level: 51 Driver Info: Bus Info: 0000:02:00.0 Driver: r8125 Firmware Version: Version: 9.007.01-NAPI Link Detected: true Link Status: Up Name: vmnic0 PHYAddress: 0 Pause Autonegotiate: true Pause RX: true Pause TX: true Supported Ports: TP Supports Auto Negotiation: true Supports Pause: true Supports Wakeon: true Transceiver: internal Virtual Address: 00:50:56:5a:e2:95 Wakeon: MagicPacket(tm)

KajLehtinen commented 2 years ago

Hi!

I have the same issue, what machine are you running your ESXi on? I've tried drivers up to 9.009.01 running on ASUS PN51 - i've read somewhere that a 2.5 GB network card with realtek, although USB based, had heat issues and started dropping connections and throttling down when heat rises. Might be a culprit since its when there is lots of transfer that it happens for me.

/Kaj

Rolzzz commented 2 years ago

yeah I decided to move away from the onboard nic in my ASUS PN50-E1 so I went with an external usb nic with the ASIX AX88179 chipset (https://flings.vmware.com/usb-network-native-driver-for-esxi) not ideal, but I wanted a stable esxi box... funnily even though usb this https://www.amazon.com.au/gp/product/B00AQM8586 was actually faster throughput too than native nic with this driver... before it would drop off of course... if driver gets updated I'd be all to happy to test again.

KajLehtinen commented 2 years ago

And you have tried the version located here: https://github.com/lengfwang/r8125-esxi6.7 - which seems to be the newest someone has compiled and put up here.

Rolzzz commented 2 years ago

And you have tried the version located here: https://github.com/lengfwang/r8125-esxi6.7 - which seems to be the newest someone has compiled and put up here.

I have not seen this one and will have to try it 👍

Rolzzz commented 2 years ago

And you have tried the version located here: https://github.com/lengfwang/r8125-esxi6.7 - which seems to be the newest someone has compiled and put up here.

I have not seen this one and will have to try it 👍

@KajLehtinen sadly still same issue

Haxiboy commented 2 years ago

And you have tried the version located here: https://github.com/lengfwang/r8125-esxi6.7 - which seems to be the newest someone has compiled and put up here.

I have not seen this one and will have to try it 👍

@KajLehtinen sadly still same issue

I have the same issue, is it really overheating?

Rolzzz commented 2 years ago

be surprised if an overheating hardware issue, we'd hear more from the normal Windows users if that were the case.

Haxiboy commented 2 years ago

be surprised if an overheating hardware issue, we'd hear more from the normal Windows users if that were the case.

I tought my issue has gone with lengfwang's fork but it happened today. It could be heat as i noticed it only happens when i put heavy workload on the NIC, after i took off my rack's side panel i had to wait 2 weeks for the issue to happen again. (I turned off the climate in the room next to the rack). I'll borrow a thermal camera and i'll monitor what's happening around the NIC and the controller, maybe a small heat sink will solve the problem.

Rolzzz commented 2 years ago

be surprised if an overheating hardware issue, we'd hear more from the normal Windows users if that were the case.

I tought my issue has gone with lengfwang's fork but it happened today. It could be heat as i noticed it only happens when i put heavy workload on the NIC, after i took off my rack's side panel i had to wait 2 weeks for the issue to happen again. (I turned off the climate in the room next to the rack). I'll borrow a thermal camera and i'll monitor what's happening around the NIC and the controller, maybe a small heat sink will solve the problem.

I can get to crash every time I send a 80gb vmdk file over via WinSCP... then have to unplug nic from switch, wait, then plug in and it starts working again... until after x min and my continuation of the WinSCP makes it fall over again.

be interested to hear if you can replicate that...

here is where mine sits in my study... I don't think heat related. image

Haxiboy commented 2 years ago

Mine is in a standard 4u rack with a ton of noctua fans. I had issue with an overheating Intel NIC before. But i have a dual gigabit NIC lying around i'll try with that too. Or maybe some load balancing would work. Strange is that we watch movies all day and torrents downloading 24/7 but got the issue only when downloading via sonarr. But it could be a coincidence.

Sushifix commented 2 years ago

I have the same issue when using the this driver under esxi 6.7 on ASUSTOR AS6702t. Due to this issue, it is not possible to use esxi on the device. Under Windows with this device this do not happen! So I expect an driver issue or configuration issue within this driver. Looking forward to solutions that are found

jakubsuchybio commented 2 years ago

I have the same problem. My setup is that i have pfsense inside my esxi host. One intel NIC passthrough into the pfsense, second realtek NIC (onboard) managed by esxi host. Same behaviour. After heavy load (downloading tens of GBs from steam) connection drops randomly. Only fix is to disconnect network cable and reconnect. Not gonna wait for fix from drivers side. Will buy new pcie NIC with intel chip and do it that way...

mcr-ksh commented 1 year ago

unfortunately the same issue here. drops every time there is load. disable/enable interface on switch reconnects it. I've build my own custom driver for the latest 9.011.00 and the issue persist.

Rolzzz commented 1 year ago

gave up with the onboard nic... POS for esxi. Got a usbc one and been rock solid ever since.

chrisp250 commented 1 year ago

gave up with the onboard nic... POS for esxi. Got a usbc one and been rock solid ever since.

Can you get a USB NIC without the CPU penalty? I ready somewhere that USB based NICs don't have access to DMA and therefore they load the CPU.

Rolzzz commented 1 year ago

gave up with the onboard nic... POS for esxi. Got a usbc one and been rock solid ever since.

Can you get a USB NIC without the CPU penalty? I ready somewhere that USB based NICs don't have access to DMA and therefore they load the CPU.

on my home system, I haven't noticed any extra unknown cpu load under normal use... I see cpu go up when I'm downloading some big (high seed) torrent files, but I saw that also on physical boxes before virtualised my torrenting machine.

mcr-ksh commented 1 year ago

In the release there are a few scripts mentioned which I cannot find anywhere, nor do I know how to properly turn on/off these settings. Anyone tried/found them?

/opt/r8125/temp.sh : Show NIC chipset temperature.
/opt/r8125/tx-off.sh: Turn off Tx offloading, when you cannot open guest openwrt web page, or lagging Windows network neighbor file copy.
/opt/r8125/tx-on.sh: Turn on Tx offloading, default.
/opt/r8125/tso-off.sh: Turn off TSO, default.
/opt/r8125/tso-on.sh: Turn on TSO, try this when you have a nice host PC.
chrisp250 commented 1 year ago

gave up with the onboard nic... POS for esxi. Got a usbc one and been rock solid ever since.

Can you get a USB NIC without the CPU penalty? I ready somewhere that USB based NICs don't have access to DMA and therefore they load the CPU.

on my home system, I haven't noticed any extra unknown cpu load under normal use... I see cpu go up when I'm downloading some big (high seed) torrent files, but I saw that also on physical boxes before virtualised my torrenting machine.

No worries. I switched to Proxmox and haven't had an issue since. The driver seems to be a lot more stable in Debian.

Rolzzz commented 1 year ago

In the release there are a few scripts mentioned which I cannot find anywhere, nor do I know how to properly turn on/off these settings. Anyone tried/found them?

/opt/r8125/temp.sh : Show NIC chipset temperature.
/opt/r8125/tx-off.sh: Turn off Tx offloading, when you cannot open guest openwrt web page, or lagging Windows network neighbor file copy.
/opt/r8125/tx-on.sh: Turn on Tx offloading, default.
/opt/r8125/tso-off.sh: Turn off TSO, default.
/opt/r8125/tso-on.sh: Turn on TSO, try this when you have a nice host PC.

no I haven't see these

mcr-ksh commented 1 year ago

I think I just found the issue. I'm currently doing a full re-write of the driver. I was able to nail it down to DAC.

image

[http://gauss.ececs.uc.edu/Courses/c4029/lectures/dma.pdf]

TSO doesn't work with DAC and maybe 6.7 doesn't properly support it. Until i'm going to release mine it can be tested via: vmkload_mod r8125 enable_tso=1 enable_tx_csum=1 eee_enable=0 hwoptimize=1 tx_no_close_enable=1 enable_double_vlan=1 use_dac=0 autoneg_mode=1

mcr-ksh commented 1 year ago

https://github.com/mcr-ksh/r8125-esxi/releases/tag/net-r8125-9.011.00

rustiferch commented 1 month ago

Hi Team, Did this network drop-off issue ever get solved? Is it something that can be worked around?

Rolzzz commented 1 month ago

Hi Team, Did this network drop-off issue ever get solved? Is it something that can be worked around?

after the Broadcom acquisition, I moved my home VM lab over to Proxmox. RIP VMware

mcr-ksh commented 1 month ago

Hi Team,

Did this network drop-off issue ever get solved? Is it something that can be worked around?

Hi, I pretty much got it under control after disabling ipv6. I found the screenshot dumps were mainly related mld and disabled ipv6 and the crashes stopped or became very seldom. On top of that with the settings of my own driver implementation I'm quite stable now.