Some questions about bmv2 priority queues

walliak commented 2 years ago

I use the nsg-ethz/p4-learning and my bmv2 version is

(HEAD detached at 62a013a) I enabled the priority and recompile the bmv2, but I can not understand how the priority work. 1. s1<->h10 in mininet , I set link bw to about 12 Mbps
- h1(flow_h1, set priority 7, iperf about 8Mbps ,10s)->s1-> h10
- h2(flow_h2, set priority 5, iperf about 8Mbps, 10s)->s1-> h10 and the resoult in h10 are below

PKT_RATIO is the dropped packet's ratio

question: The strict priority seems to drop the lower priority flow's packets. I can not understand why the latency of flow_h1 is high, although lower than flow_h2. In my understanding, if flow_h1 and flow_h2 are in diffrent queues, flow_h1 should be 2~3 ms (below 10ms).

2. s1<->h10 in thrift-cmd.txt set_queue_rate 500 9 //port 9 is to h10 (pps 500 is about 6Mbps) in mininet the s1<-> h10 link bw is about 150Mbps

h1(flow_h1, set priority 7, iperf about 8Mbps ,10s)->s1-> h10
h2(flow_h2, set priority 5, iperf about 8Mbps, 10s)->s1-> h10 question: The method of set_queue_rate is equal to every priority port. The flow_h1 and flow_h2 seems to be in the different queue ,but flow_h1 and flow_h2 behave the the same. So what does the priority mean?

In the physics p4 switch (like tofino sw), does the traffic manager have the same priority scheduler or can the vendor design their own queue manager which is entirely different from the bmv2?

@jafingerhut

jfingerh commented 2 years ago

My understanding from earlier experiments I did a while back (sorry, not detailed results or scripts to share -- I don't think I have them any more), is that veth interfaces can store a FIFO queue of a significant number of packets, and if bmv2 tries to send them packets as fast as your CPU cores allow, it will accept them at that rate. Only when this internal veth buffer gets to its configured capacity will it cause bmv2 to send packets to the veth slower.

If you imagine a physical 12 Mbps link with constant bit rate, at the sending side, you can only transmit at 12 Mbps, with no "burst" faster than that at any time, and no queue that builds up inside of the link.

A veth with a configured max rate of 12 Mbps rate seems to operate more like the link above, with a FIFO queue in front of it, and until that buffer is full, you can enqueue packets in that FIFO queue much faster than 12 Mbps.

If true, then bmv2 is filling up that FIFO queue, and then it keeps adding packets to that FIFO queue only as fast as it drains, i.e. 12 Mbps. So there is a standing queue built up inside of the veth interface in the kernel, and it remains at a nearly constant depth during most of your experiment.

If you wanted your system to behave more like a system of physical cable links with constant bit rate, you would need to prevent bmv2 from sending packets to the veth interfaces faster than the configured 12 Mbps rate, so that no queue would ever build up inside of the veth.

jafingerhut commented 2 years ago

See this issue, which if anyone knows someone who might be interested in implementing that, would make bmv2 more useful for such emulation of low speed physical networks with constant bit rate links.

(edit: actually adding the link I forgot before): https://github.com/p4lang/behavioral-model/issues/1067

antoninbas commented 2 years ago

As @jafingerhut pointed out, the main issue is that the queueing system cannot enforce a rate across all the priority queues for a given port.

In the physics p4 switch (like tofino sw), does the traffic manager have the same priority scheduler or can the vendor design their own queue manager which is entirely different from the bmv2?

The bmv2 implementation for priority queues is not representative of any hardware implementation. But we welcome contributions to bmv2 which provide support for alternative scheduling implementations.

@walliak one idea for you is to reduce the buffer size for your veth interfaces as much as possible. This should give you the "correct" latency. I don't remember if this can be done through the Mininet API, but this can be done with the ip tool.

walliak commented 2 years ago

s1-eth9 is the one of the peer veth from s1 to h10, ifconfig -a s1-eth9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9500 ether 00:01:0a:00:01:0a txqueuelen 1000 (Ethernet) RX packets 2 bytes 180 (180.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

I use the command : sudo ip link set s1-eth9 txqueuelen 10

s1-eth9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9500 ether 00:01:0a:00:01:0a txqueuelen 10 (Ethernet) RX packets 4 bytes 3204 (3.2 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 10196 bytes 15416352 (15.4 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

In h10 another one is h10-eth0, I do the same work. But the resoult seems the same to above(1), I don't konw how to use ip tools or get more infomations to decrease the buffer size . @antoninbas

jafingerhut commented 2 years ago

I do not know the behavior and 'rules' for setting the txqueuelen parameter of a veth, but as an idea for a quick experiment to try to see if you get different results, try using the sudo ip link set <name> txqueuelen 10 command on "both ends" of the veth pair?

antoninbas commented 2 years ago

These look like the correct commands to me. Another thing is that there is an extra buffer on output in simple_switch. I have been meaning to remove it for a while now: https://github.com/p4lang/behavioral-model/blob/e1fcd5d54cecf7679f46ac462fdf92e049711e6c/targets/simple_switch/simple_switch.cpp#L214

It can hold 128 packets, which means that for 1500B packets at 12Mbps, it will incur 128ms of latency. As an exercise, you can change the size from 128 to 1.

But I think the bottom line is that some fundamental changes are needed in simple_switch to run that kind of experiment:

some rate limiting should be enforced cumulatively over all egress queues for a given port
the output buffer in simple_switch should probably be removed as it needlessly adds latency. Note that if rate limiting is correctly implemented in bmv2 and correctly configured, this buffer should always be empty even if it is not removed.
I feel like that there should be an option to get a dedicated system thread for each egress port (for egress processing) to avoid head of line blocking across ports. But maybe that's just a nice-to-have. Again if the first bullet point is implemented, it may not really be needed.

walliak commented 2 years ago

First, I do the following task,

set output_buffer to 1 and recompile bmv2
sudo ip link set s1-veth9 txqueuelen 1
in h10 ,sudo ip link set h10-eth0 txqueuelen 1

Then I do the test (1) again, s1<->h10 in mininet , I set link bw to about 12 Mbps h1(flow_h1, set priority 7, iperf about 8Mbps ,10s)->s1-> h10 h2(flow_h2, set priority 5, iperf about 8Mbps, 10s)->s1-> h10 and the resoult in h10 are below

The latency of h1 has reduced about 120ms from 210-220ms to 90-100ms, proving the output_buffer does work. And the small output_buffer cause the bandwidth fluctuation. But what cause the remained latency (90~100ms)?

I am sure my cpu can deal with almost 28Mbps ,so the remained latency is not the processed latency. I think there may be still extra buffer in the vether pair, the sudo ip link set <name> txqueuelen <N> may be not helpful and I am looking for it.

@antoninbas @jafingerhut

jfingerh commented 2 years ago

When I was experimenting with this before to try to find out where the latency was occurring, I used tcpdump to record packets at both end of the veth pair, and then sent packets that I knew had some field that was always incrementing from one packet to the next, e.g. if the IPv4 Identification field is already incrementing with each packet from whatever program you are using to generate packets, that can be used, otherwise you can modify your traffic generator to use some other field in the packet for this purpose. Then make plots of the time that tcpdump captures each packet on both sides of the veth pairs. That can help you see whether there is varying latency between the two ends of the veth pairs, at least.

walliak commented 2 years ago

@jafingerhut @antoninbas s1<->h10 in mininet , I set link bw to about 12 Mbps

h1(flow_h1, set priority 7, iperf about 8Mbps ,10s)->(s1-eth1)s1(s1-eth9)->(h10-eth0) h10 h2(flow_h2, set priority 5, iperf about 8Mbps, 10s)->s1-> h10

I use tcpdump listening on s1-eth1 s1-eth9 h10-eth0, by using iperf to generate traffic. Although the packets have the same fields, the timestamp of the same number of received packets on the three ifaces can cacultate the latency relatively,if the three programs 'tcpdump' start at the same time and packets don't drop and arrive in order.

And I find that the latency between s1-eth9 and h10-eth0 is small but the latency between s1-eth1 and s1-eth9 is large. It seems packets spend much time passing the switch, so I see the p4s.s1.log.

The log is about 80MBytes, I don't want to upload.I seek for some details from the log. This is one h1's packet whose priority is 7.

[18:40:47.645] [bmv2] [D] [thread 47357] [1953.0] [cxt 0] Looking up key:
* standard_metadata.enq_timestamp           : 011b2bce
* standard_metadata.enq_qdepth              : 000036
* standard_metadata.deq_timedelta           : 000000bb     (line 65800)
* standard_metadata.deq_qdepth              : 000036
* standard_metadata.qid                     : 07
* standard_metadata.egress_global_timestamp : 00000000011b2c88
* standard_metadata.ingress_global_timestamp: 00000000011b2a51

This is the h1's packet next to the above one which can log the queue information too.

[18:40:47.692] [bmv2] [D] [thread 47357] [1956.0] [cxt 0] Looking up key:
* standard_metadata.enq_timestamp           : 011b344c
* standard_metadata.enq_qdepth              : 000034
* standard_metadata.deq_timedelta           : 0000b1ab   (line129822)
* standard_metadata.deq_qdepth              : 000060
* standard_metadata.qid                     : 07
* standard_metadata.egress_global_timestamp : 00000000011be5f4
* standard_metadata.ingress_global_timestamp: 00000000011b3342

between the line 65800 and line129822 , there are many h1's packets but they don't log the queue infomations, why? I get the queue information by the following way.

From the log, latency is about 47ms(692-645=47ms)and deq_timedelta(0xb1ab=45483us) shows the same. It is strange that the deq_timedelta varies from 0xbb(187us) to 0xb1ab(45483us).I think many h1's packets flood into ingress but the TM doesn't deal with them,.

jafingerhut commented 2 years ago

You say "if the three programs 'tcpdump' start at the same time and packets don't drop and arrive in order".

If you have a field that uniquely identifies each packet, you can use that to detect whether packets are dropping, or arriving out of order, without assuming it.

Also, tcpdump should record absolute system times for each packet in the file, not relative times. Display programs often display relative times, because that is their default setting, and behind the scenes they are calculating differences between these absolute times to show you relative times. You don't need to start them near the same time in order to read the absolute times from the pcap files recorded.

jafingerhut commented 2 years ago

You ask: "between the line 65800 and line129822 , there are many h1's packets but they don't log the queue infomations, why?"

I would recommend finding a place to upload the 80 MByte file if you want others to take a look at it. That isn't terribly large for publishing in a public Github repo, for example, or on a free Google Drive account. Do not attach the log to a Github issue, only a link to it stored elsewhere.

Remember that a packet's ingress processing and egress processing can happen far apart in time, because of queueing latency. It is also possible that many packets print their logs doing ingress processing, between two packets doing egress processing. That might be why you see two egress logs very far apart in the log file with many other packets in between -- all of those packets in between might be ingress only processing logs for many packets.

walliak commented 2 years ago

Thanks for your detailed answers!

Actually, my main question is why h1's deq_timedelta is about 45ms (line 129822 ). 'between the line 65800 and line129822 , there are many h1's packets but they don't log the queue infomations' shows that some h1's packets are in the queues, I think. How to explain the high priority packets have about 45ms's queue latency?
I used Iperf as my traffic tool before and got some tcpdump logs. Maybe you can have a look, though not accurate. Then, I will use the scapy to add ip field to indentify the packet and get some logs.
the all log link : https://drive.google.com/drive/folders/1hgJJ_eEPXlD3l75vQNKqaVXcbL78pU9C?usp=sharing

jafingerhut commented 2 years ago

Note that even with using iperf, it is common for the IPv4 code in the Linux kernel to use consecutive (i.e. incrementing by 1 from 1 packet to the next) values for the IPv4 identification field. If you log that value in your P4 code with a statement like log_msg("packet_id {}", hdr.ipv4.id); (or whatever that field is named in your P4 code), you can see all of those values for every packet in the log. Or you can add that field as a key in your existing 'debug tables' if you prefer to keep that info together in one place. You can log the value once somewhere in your ingress code, and separately in your egress code, so see what the time differences are.

jafingerhut commented 2 years ago

Also first try getting the latency of packets when they are sent at a very low bit rate, e.g. something like 1 to 10 packets per second. That isn't guaranteed to give you the minimum latency, because I don't know whether veth interface reading of packets might be "saving up" packets before processing them, or have some kind of interrupt delaying technique after the first packet arrives in a while. However, it will at least give you a point of comparison for the high packet rate latencies you are measuring.

walliak commented 2 years ago

As you can see, there is an identification in IP fileds. And we can get some infomation to support earlier conclusion. I choose the same IP Identification: 0x9014 (36884) and the absolute time in the three vethes are : s1-eth1, 18:40:53.283238 s1-eth9, 18:40:53.377349 h10- eth0, 18:40:53.377353 So, the remained latency (90~100ms) comes from the time going through the switch.

antoninbas commented 2 years ago

I don't know why the queue grows that large for the highest priority flow (which seems to incur the extra latency), but I don't think this is an experiment you should be running with logging enabled. Performance goes down significantly with logging enabled, and you should definitely re-build bmv2 with --disable-logging-macros. See https://github.com/p4lang/behavioral-model/blob/main/docs/performance.md

walliak commented 2 years ago

I am sorry to forget it. I have definitely rebuild bmv2 with disabled logging ,but it seems to be the same result.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

xhy199904 commented 1 year ago

the same question, why the delay of high priority queue is so high

xhy199904 commented 1 year ago

@walliak do you solve this problem ？

p4lang / behavioral-model

Some questions about bmv2 priority queues #1066