nbd168 / bridger

20 stars 5 forks source link

bridger is not reliable (fails to register traffic) #3

Closed nicefile closed 7 months ago

nicefile commented 1 year ago

After a while after starting bridger doesn't register new connections and file /sys/kernel/debug/ppe0/bind stays empty for new/current traffic but after a while work again /etc/init.d/bridger restart fix this for current traffic instantly

link to forum thread where others confirm this https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/147?u=nicefile

build from 21-04-2023 @ cudy wr3000 mt7981

nicefile commented 1 year ago

r22967-f18cb0ba63 on freshly supported wr3000 still doesn't register some of the connection /etc/init.d/bridger restart fix this for current traffic to duplicate this issue just use iperf3 test between wired and wireless host. I see no cpu hogging that plague previous bridger build

Fail-Safe commented 11 months ago

I can confirm I am seeing the same as well. I posted details here: https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/177?u=_failsafe

Running on RT3200 build r24615-25e215c14e. However, I am no longer seeing WED crashing as noted here: https://github.com/openwrt/mt76/issues/754#issue-1601750830

rany2 commented 11 months ago

This seems to solve the issue for me, but I'm not sure why it does:

diff --git a/flow.c b/flow.c
index 61564c0..c7a599a 100644
--- a/flow.c
+++ b/flow.c
@@ -160,7 +160,6 @@ bridger_flow_update_cb(struct uloop_timeout *timeout)
    avl_for_each_element_safe(&flows, flow, node, tmp) {
        avl_delete(&sorted_flows, &flow->sort_node);
        bridger_bpf_flow_update(flow);
-       bridger_nl_flow_offload_update(flow);
        avl_insert(&sorted_flows, &flow->sort_node);

        flow_debug_msg(flow, "Update");

I do not know what's wrong with https://github.com/nbd168/bridger/blob/3159bbe0a2ebcea9f209bbca88dcd5ac86f7a7f1/nl.c#L734-L739 but I don't think it's an issue in handle_filter(). I made handle_filter() a noop and there was no change, only not sending that RTM_GETTFILTER command to cmd_sock by not calling bridger_nl_flow_offload_update fixed it.

Of course, I'm sharing this is only in the hopes that it helps find the source of the issue; not for you to use the patch; though it does seem to work fine.

Fail-Safe commented 11 months ago

Very interesting find! I rebuilt the firmware for my three RT3200s including the change you made in flow.c and sure enough, I'm still seeing flows in /sys/kernel/debug/ppe0/bind even after about 20 minutes of uptime. Longest I've ever seen it keep working.


Update: This is wild! It is still working nearly 12 hours later!

I know @nbd168 has to be pulled in a million other directions, but hopefully he can give this a look and get some updates into bridger. 😃

rany2 commented 11 months ago

It's weird that RTM_GETTFILTER causes this issue because as far as I know, it shouldn't cause any changes. I can even trigger the issue again with while :; do tc -s filter show dev eth0 ingress >/dev/null; sleep 1; done and bridger_nl_flow_offload_update commented out like above.

I think it could be kernel bug but not sure.

imwhocodes commented 9 months ago

I'm still seeing this issue with "OpenWrt SNAPSHOT r25136-6497cdba09" and "bridger 2023-05-12-d0f79a16", is there any update or ii is still better to keep WED disabled on a DumpAP?

Fail-Safe commented 9 months ago

The WED crash issue seems to be fixed. See details toward the end of: https://github.com/openwrt/mt76/issues/754#issue-1601750830

I'm using @rany2's patch from here and it has kept the WED offloading working for me.

imwhocodes commented 9 months ago

The WED crash issue seems to be fixed. See details toward the end of: openwrt/mt76#754 (comment)

I'm using @rany2's patch from here and it has kept the WED offloading working for me.

Thanks, So there is no any pre-packaged build of it, but I need to build myself?

Fail-Safe commented 9 months ago

Correct, at this point you'd have to build and patch yourself.

skramstad commented 8 months ago

Just testing my bpi-r3 as a dumb AP and latest snapshot. I have also tested kernel 6.6 and bridger with the same result. I see that bridger does not get new flows after a minute or so...

But now, I've been testing bridger by removing this line. https://github.com/nbd168/bridger/issues/3#issuecomment-1865342049 And I can see new flows again.

-       bridger_nl_flow_offload_update(flow);

Thanks @rany2 👍

nicefile commented 8 months ago

@rany2 I've took liberty to create PR with your proposed workaround . Maybe this will catch @nbd168 attention

nicefile commented 8 months ago

bridger with rany2 patch for OpenWrt 23.05.3 on my gdrive

gssjshark commented 8 months ago

how do we apply this patch? sorry, I am relatively new to openwrt. thanks for your help!

nicefile commented 8 months ago

@gssjshark Lets assume you're in OpenWrt folder

mkdir package/network/services/bridger/patches
wget -O package/network/services/bridger/patches/10-fix-issue-3.patch "https://github.com/nbd168/bridger/pull/5/commits/c73bf1f80999db1fe5dbf5c082a9e77862b35d58.patch"

then build your package or whole firmware

or You can install package for OpenWrt 23.05.3 from https://github.com/nbd168/bridger/issues/3#issuecomment-2016912962

Fail-Safe commented 8 months ago

@nbd168 Hey Felix, do you have any feedback around the findings from @rany2 in post https://github.com/nbd168/bridger/issues/3#issuecomment-1865342049?

nbd168 commented 7 months ago

Please try the latest version

rany2 commented 7 months ago

I'll test it out tomorrow, thanks as always for your efforts. Hopefully you could find a tester that can respond earlier.

Fail-Safe commented 7 months ago

@nbd168 Updated my build to run with https://github.com/nbd168/bridger/commit/c77a7a1ff74d9d4065270239240366c1e6bd9986. So far, so good.

After 50 minutes of uptime, I am still seeing flows when watching /sys/kernel/debug/ppe0/bind. I typically would have seen the flows "disappear" within a handful of minutes (often less than 5 mins). I'll give another update after I let this cook overnight and see how things look.

Thank you, @nbd168!

Fail-Safe commented 7 months ago

@nbd168 Still seeing flows 12+ hours later. Commit https://github.com/nbd168/bridger/commit/c77a7a1ff74d9d4065270239240366c1e6bd9986 seems golden, IMHO. Many thanks!

rany2 commented 7 months ago

I think this issue could be closed, seems solved for me.

nbd168 commented 7 months ago

thanks for testing!