p4lang / behavioral-model

The reference P4 software switch
Apache License 2.0
531 stars 327 forks source link

Bringing a bmv2 (simple_switch) interface down freezes the switch program in Ubuntu 22.04 #1215

Open edgar-costa opened 10 months ago

edgar-costa commented 10 months ago

There seems to be some problem with bmv2 (tested with simple_switch) when you bring one of the switch interfaces down. I have done the same exact tests with three different VMs running Ubuntu 18.04, 20.04, and 22.04. The problem I will describe only happens in Ubuntu 22.04. Thus, I am assuming this might be some interaction between veth and bmv2 in Ubuntu 22.04.

These are my findings:

How to replicate

System settings: Ubuntu 22.04, latest version of p4c and bmv2.

jafingerhut commented 10 months ago

Sounds very much like whatever code BMv2 is using to read packets from interfaces is blocking indefinitely during the sequence of steps you describe, until/unless a new packet is sent to the interface that was taken down (if it is first brought back up again). Unfortunately I do not know where in the BMv2 implementation that is. If you want to try to track it down, I would suggest starting from the call to input_buffer->pop_back(&packet); here: https://github.com/p4lang/behavioral-model/blob/main/targets/simple_switch/simple_switch.cpp#L483

and work your way back to wherever there is a call that actually gets packets from veth interfaces.

edgar-costa commented 10 months ago

Hi @jafingerhut , thanks for the reply!

Thanks for the pointer. I do not know much about the BMV2 implementation but I can try to dig down a bit more for that call at the ingress_thread. However, given this only happens with Ubuntu 22.04, and that the issue gets triggered upon an interface down event, I am guessing this might be more of a kernel issue than a bug in bmv2. Or it might be a combination of both. I will try to investigate a bit more.

jafingerhut commented 10 months ago

Ah, sorry, I missed the point about there being one Ubuntu version that exhibited the problem. Any chance you can try an Ubuntu 23.04 system to see if the problem also exists there, and perhaps record the Linux kernel versions of the systems you tested with?

edgar-costa commented 10 months ago

I just tried with Ubuntu 22 and Kernel 6.2 (original was 5.15), and the problem persists.

It might be too early to say this is just an "only kernel" problem. It might be a combination of a change in the kernel /net/veth.c and something in bmv2 and its binding/interaction with the veth interfaces. But since I don't know the code very well I still did not find anything.

antoninbas commented 9 months ago

@edgar-costa this could be an issue with BMI, the libpcap wrapper for bmv2. There is a single thread that runs a loop and reads packets from all the interfaces, using select: https://github.com/p4lang/behavioral-model/blob/d56d5658e34ca68ae9efdd396f8eb54facc67a2a/src/BMI/bmi_port.c#L108

What would be helpful if you have time is to run simple_switch using gdb. Of course bmv2 has to be compiled with the right flags and symbols enabled (-O0 -g should work). After you reproduce the "deadlock", you should dump a backtrace for all threads in gdb (thread apply all bt). I'm hoping that would help.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days