packet_buffer error - Githubissues

Benature commented 2 years ago

When I run mininet, sending multiple packets continuously, say around 50 packets, The mininet will die. In the switch log, there is an error:

simple_switch: ../../include/bm/bm_sim/packet_buffer.h:90: char* bm::PacketBuffer::push(size_t): Assertion `data_size + bytes <= size' failed.

My bmv2 version is commit b447ac4c0cfd83e5e72a3cc6120251c1e91128ab

How can I know the limit of PacketBuffer and what should I be care of when I want to avoid this error?

jafingerhut commented 2 years ago

From a quick look at the code of behavioral-model and when it can cause such an assert failure, one way it could happen is if you tried running a P4 program that, during deparsing, attempted to emit headers whose total length exceeded 512 bytes. Is the P4 program you are running attempting to do this in some cases, perhaps?

If yes, this limit of 512 bytes is a constant value in the behavioral-model source code, which can be changed by using a text editor, and then you would need to recompile the behavioral-model source code.

However, if this is the case, I would stress that even though it might be very quick work to change the behavioral-model source code to enable emitting more than 512 bytes of headers in its deparser, other high performance, low-cost-per-Tbps-throughput P4 programmable devices might have even lower limits than that, and they cannot be so easily modified. If your goal is to eventually run your P4 code on such devices, you might want to consider changes to your P4 code that avoid emitting so many header bytes.

antoninbas commented 2 years ago

Yes, @jafingerhut is correct.

https://github.com/p4lang/behavioral-model/blob/f16d0de3486aa7fb2e1fe554aac7d237cc1adc33/targets/simple_switch/simple_switch.cpp#L239-L243

Note that I don't see any reason why we couldn't add a command-line option to change it at runtime for simple_switch / simple_switch_grpc. Something like --max-headers-size. If this is something you are interested in, feel free to submit a PR.

Benature commented 2 years ago

I think I limit the header length, but it still goes wrong. So I recompile bmv2 with debug flag.

For normal processing, the log should be

[07:03:52.479] [bmv2] [T] [thread 26357] [122.1] [cxt 0] ./src/slot2.p4(901) Condition "hdr.ipv4.totalLen > 500" (node_132) is false
[07:03:52.479] [bmv2] [D] [thread 26357] [122.1] [cxt 0] Pipeline 'egress': end
[07:03:52.479] [bmv2] [D] [thread 26357] [122.1] [cxt 0] Deparser 'deparser': start
[07:03:52.479] [bmv2] [D] [thread 26357] [122.1] [cxt 0] Updating checksum 'cksum'
[07:03:52.479] [bmv2] [D] [thread 26357] [122.1] [cxt 0] Deparsing header 'ethernet'
[07:03:52.479] [bmv2] [D] [thread 26357] [122.1] [cxt 0] Deparsing header 'ipv4'

For the error, the log shows

[07:04:44.648] [bmv2] [T] [thread 26472] [45.42] [cxt 0] ./src/slot2.p4(901) Condition "hdr.ipv4.totalLen > 500" (node_132) is false
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Pipeline 'egress': end
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Cloning packet at egress
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Cloning packet to egress port 5
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Deparser 'deparser': start
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Updating checksum 'cksum'
simple_switch: ../../include/bm/bm_sim/packet_buffer.h:90: char* bm::PacketBuffer::push(size_t): Assertion `data_size + bytes <= size' failed.

Seems like it goes wrong before Deparsing header 'ethernet'. Or how can I get a log about how long my header will be deparse, since in P4 I check the hdr.ipv4.totalLen > 500 and it is false (If I setValid or setInvalid a header I will modify the totalLen).

Benature commented 2 years ago

Since the error happened after successfully processing a lot of same packets, I'm not quite sure how to debug with the logs.

jafingerhut commented 2 years ago

If you are willing and able to publish a complete test case, including P4 source code, commands you used to compile it, command line options given to simple_switch / simple_switch_grpc when you started it, control plane operations to add table entries before this assert'ing packet was processed, and what the contents of that packet were, such that someone else could reproduce it, perhaps they may discover the root cause.

If I were going to debug this, I would add my own debug print statements in the C++ source code at various places to see what the values of data_size, bytes, and size were just before the assert statement fired, then try working backwards from there to earlier points in the program to see what is going wrong.

As another guess, I see in the failing case you are doing an egress clone operation. Are you perhaps doing an egress clone of a packet "in a loop"? For example, packet A arrives to ingress, then goes to egress processing, where it does an egress-to-egress clone operation, creating clone packet C1? Then is C1 during egress processing also itself doing another egress-to-egress clone operation, creating packet C2? And C2 might do another clone operation, etc? I would guess that if you are, and the total number of header bytes emit'd in the deparser for all egress passes of the same original packet exceeds 512, that might also cause this assertion. That is only a guess on my part, though.

jafingerhut commented 2 years ago

Oh, notice this in your log messages:

[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Pipeline 'egress': end
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Cloning packet at egress
[07:04:44.648] [bmv2] [D] [thread 26472] [45.42] [cxt 0] Cloning packet to egress port 5

See the [45.42] part? The 45 is a number that increments for every packet that arrived to simple_switch during the run. The .42 after that is a number that increments every time you make another multicast or clone copy of the original packet, i.e. the log messages for the original packet should be labeled [45.0], and if you create a clone copy of that, the messages for processing that clone copy will have [45.1]. If you create a clone of the clone, those will have [45.2], etc.

That lends some evidence that what I asked about in my previous comment may actually be happening: you are doing a clone, of a clone, of a clone, etc, in a loop that repeats at least 42 times, and it is failing on the 42nd clone copy. There might also be multicast involved in ingress on the first time the packet arrived that could increase the number of copies, too.

Benature commented 2 years ago

Thank you for your reply! I think the cause is found.

Here's one more question, where can I find the document about the log? I didn't know what [45.42] means until your reply, so I want to know the meaning of like the [ctx 0] or the ['thread xxxxx], which may help me to solve later questions by myself.

Benature commented 2 years ago

And considering the cloned packets will exceed the header limitation, I think digest can be the walkaround of this problem?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

p4lang / behavioral-model

packet_buffer error #1123