nhorman / dropwatch

user space utility to interface to kernel dropwatch facility
GNU General Public License v2.0
632 stars 92 forks source link

dwcap appears to be missing from the centos 7 rpm #78

Closed hagfelsh closed 1 year ago

hagfelsh commented 1 year ago

As I understand it, dwcap captures packets that would or will be dropped so they can be examined retrospectively with wireshark.

Examining the dropwatch-1.4-9.el7.x86_64.rpm shows no such file, though the source rpm shows src/dwcap.c.

Am I misunderstanding how to find this particular tool?

bin
└── dropwatch
share
├── doc
│   └── dropwatch-1.4
│       ├── COPYING
│       └── README
└── man
    └── man1
        └── dropwatch.1.gz
nhorman commented 1 year ago

I think you're thinking of the dwdump utility, not dwcap, but no matter, you have the purpose of the utility correct. What you're missing is the fact that dwdump didn't get added to the dropwatch package until release 1.5.2, so its certainly not going to be available in RHEL7's 1.4 package. Even if you build the latest dropwatch, the kernel support for using dwdump isn't going to be present in the RHEL7 kernel, so you're out of luck there, unless you want to write a very large check to IBM :)

hagfelsh commented 1 year ago

Oh how about that lol thanks for the quick reply!

hagfelsh commented 1 year ago

That prompts another question on the side; your tool is the standard for capturing dropped traffic in Linux. What else exists in the world that does anything like this? As you might have guessed, I'm trying to understand what's being dropped at the driver and I've not yet found any way to determine what it is.

nhorman commented 1 year ago

At the driver level you're generally left with 2 choices: 1) a custom bpf or systemtap program you write to monitor specific code paths 2) some custom driver level interface debug tool

(2) isn't going to exist for any open source driver, but some proprietary drivers may have something for you

systemtap is usually a pretty good way to drill down on what you're looking for, but a better step 0 is to take a look at the data you have that is suggesting that you are dropping packets and brainstorm causes. What data do you have that is suggesting dropped packets in the driver?

hagfelsh commented 1 year ago

Yikes I'm at the edge of the world!

The only thing I have to support it's the driver is that the drop increments are being reported in /sys/devices//net//statistics/rx_dropped , which I think I remember reading is provided by the driver. I can't find the kernel.org txt page that says that now, of course...

nhorman commented 1 year ago

What driver?

hagfelsh commented 1 year ago

140e for the X710.

nhorman commented 1 year ago

i40e grabs those stats from the hardware (mapping the software rx_dropped stat to its hardware rx_discards counter(s)). You can check the function i40e_stats_update_rx_discards to see how it works, it calls i40e_stat_update64, which pulls hardware stats fro the chip, and updates the software counter structures.

So you're kinda out of luck searching for a software drop in the driver, because there isn't any. The drops are occurring in the hardware prior to the driver ever receiving them. You can use ethtool -S to get more detailed stats, as the i40e driver I think breaks out drop stats to something a little more granular that might give you an idea of why this is happening. That said, usually the cause for something like this is an overrun - i.e. the data coming in on the hardware is getting hashed to a receive queue that the corresponding CPU can't keep up with, and so the hardware drosp frames because the CPU isn't draining the CPU fast enough. Sugest using ethtool to check queue lengths and hash destinations using the ntuple settings. It won't help you with a root cause, but you also might try enabling pause frames to prevent drops of this nature

hagfelsh commented 1 year ago

This is marvelous advice, thank you so much!