tohojo / sqm-scripts

SQM scripts traffic shaper
http://www.bufferbloat.net/projects/cerowrt/wiki/Smart_Queue_Management
227 stars 64 forks source link

ipq806x / Netgear R7800: difference in download speeds with cake and simple/fq_codel (possibly due to HTB ) #48

Closed hnyman closed 7 years ago

hnyman commented 7 years ago

I have a new router Netgear R7800 that is an 1,7 GHz dual-core IPQ8065 device. I have been building and testing new LEDE builds on it and I have stumbled upon strange behaviour with SQM using simple/fq_codel:

Data with wired LAN connection:

It is almost like the simple/fq_codel would produce in ipq806x a realised download speed of some 75% of the set limit. With the old trustry ar71xx/WNDR3800 the speed gets higher upto the hardware limits.

This almost looks like there is some kind of calculation bug in the fq_codel code when compiled for IPQ806x (code base is arm_cortex-a15_neon-vfpv4 )

Any clues to what could cause this? How to debug?

tohojo commented 7 years ago

The absolute speed difference between what you configure and what Flent measures is not too worrisome. There's quite a bit of overhead in-between there (Flent measures at the application level, and doesn't include the overhead of the latency measurement flows).

However, the difference between Cake and HTB is odd. I'd start by looking at the output of tc -s for both setups. Best if you do a restart of SQM and post the output after a test run; then we also get drop stats.

Also, actually looking at the CPU usage would be useful. Flent can capture it if you add '--test-parameter cpu_stats_hosts=root@router' to the invocation (and have a suitable private key login established).

-Toke

hnyman commented 7 years ago

Well, the CPU stats option seems to produce lots of sleep: invalid number '0.20' errors, and I haven't got it to produce data for flent.

But using Luci statistics, I see both cores having something like 4-5% utilisation during flent run. And the CPU frequency scales from 384 MHz to about 1.3 MHz, so there is still well CPU room.

tc output for simple:

root@lede:~# tc -s qdisc show
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 1: dev eth0 root refcnt 2 r2q 10 default 12 direct_packets_stat 0 direct_qlen 1000
 Sent 148659567 bytes 556934 pkt (dropped 4351, overlimits 779702 requeues 10)
 backlog 0b 0p requeues 10
qdisc fq_codel 110: dev eth0 parent 1:11 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 2067 bytes 22 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 181 drop_overlimit 0 new_flow_count 22 ecn_mark 0
  new_flows_len 1 old_flows_len 0
qdisc fq_codel 120: dev eth0 parent 1:12 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 148657500 bytes 556912 pkt (dropped 4351, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 28766 drop_overlimit 0 new_flow_count 134590 ecn_mark 0
  new_flows_len 1 old_flows_len 5
qdisc fq_codel 130: dev eth0 parent 1:13 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc ingress ffff: dev eth0 parent ffff:fff1 ----------------
 Sent 1350036951 bytes 995725 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 7255991125 bytes 5268555 pkt (dropped 0, overlimits 0 requeues 341)
 backlog 0b 0p requeues 341
  maxpacket 27252 drop_overlimit 0 new_flow_count 192945 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 1: dev ifb4eth0 root refcnt 2 r2q 10 default 10 direct_packets_stat 0 direct_qlen 32
 Sent 1378833367 bytes 994594 pkt (dropped 602, overlimits 705085 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 110: dev ifb4eth0 parent 1:10 limit 1001p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 1378833367 bytes 994594 pkt (dropped 602, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 36336 drop_overlimit 0 new_flow_count 91349 ecn_mark 0
  new_flows_len 1 old_flows_len 4
root@lede:~#

cake:

root@lede:~# tc -s qdisc show
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 800d: dev eth0 root refcnt 2 bandwidth 8Mbit besteffort flows rtt 100.0ms raw
 Sent 149129205 bytes 713102 pkt (dropped 6356, overlimits 1059366 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 229684b of 4Mb
 capacity estimate: 8Mbit
                 Tin 0
  thresh         8Mbit
  target         5.0ms
  interval     100.0ms
  pk_delay      18.7ms
  av_delay       3.4ms
  sp_delay         1us
  pkts          719458
  bytes      158750721
  way_inds           0
  way_miss          57
  way_cols           0
  drops           6356
  marks              0
  sp_flows           1
  bk_flows           1
  un_flows           0
  max_len        15140

qdisc ingress ffff: dev eth0 parent ffff:fff1 ----------------
 Sent 1758959542 bytes 1286134 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 9056653992 bytes 6554425 pkt (dropped 0, overlimits 0 requeues 343)
 backlog 0b 0p requeues 343
  maxpacket 27252 drop_overlimit 0 new_flow_count 208389 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 800e: dev ifb4eth0 root refcnt 2 bandwidth 100Mbit besteffort flows rtt 100.0ms raw
 Sent 1801028976 bytes 1285401 pkt (dropped 733, overlimits 823485 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 355692b of 5000000b
 capacity estimate: 100Mbit
                 Tin 0
  thresh       100Mbit
  target         5.0ms
  interval     100.0ms
  pk_delay       133us
  av_delay         8us
  sp_delay         2us
  pkts         1286134
  bytes     1802138738
  way_inds           0
  way_miss          58
  way_cols           0
  drops            733
  marks              0
  sp_flows           1
  bk_flows           1
  un_flows           0
  max_len        37850

root@lede:~#
tohojo commented 7 years ago

Ah, right, the Flent CPU measurement doesn't actually work on *wrt. Sorry, forgot about that. But if luci shows a couple of percent of CPU usage, that's probably fine.

The HTB output doesn't show any rate information, so hard to see what is going on. @moeller0 any ideas?

Only thing that comes to mind is that you may have enabled overhead compensation (for ATM or just a per-packet overhead) for HTB but not for cake?

moeller0 commented 7 years ago

Well tc -d -s class show dev eth0 ; tc -d -s class show dev ifb4eth0 should show the "missing" rates for HTB as well. Hnyman is quite experienced (more so than I) so I assume he monitored CPU usage well; that said, what I typically do if pressed for tie is ssh into the router and run "top -d 1" on the command line and simply look at the second line (e.g.): CPU: 0% usr 1% sys 0% nic 97% idle 0% io 0% irq 0% sirq With idle and sirq being the most important values, once idle approaches 0 all bets are off, and if sirq gets too high HTB will temporarily choke. Unfortunately busybox top will not allow higher refresh rates than 1 second and I am also uncertain. I remember that HTB and cake behave differently under CPU starvation, HTB will take a much bigger bandwidth hit than cake (cake is prone to allow a bit more latency under these conditions with continuous CPU starvation); maybe there are cyclic small CPU "stalls" that cake simply averages over while HTB misses a few TX-OPS...

Best Regards

moeller0 commented 7 years ago

@hnyman I believe busybox sleep will only accept integers as inputs with seconds being the smallest unit:

LEDE r2155 (your built, thanks) root@nacktmulle:~# sleep 0.2 sleep: invalid number '0.2' root@nacktmulle:~# sleep 2 root@nacktmulle:~#

Suse Leap 42.N moeller@happy-horse:~> sleep 0.2s moeller@happy-horse:~>

Maybe installing GNU coreutils can give you a more fomplete sleep binary on the router?

Best Regards

hnyman commented 7 years ago

I tested again. I used this time "htop" to monitor CPU util on the screen and cake seems to cause higher util...

Similarly "top" shows mostly idle cpu:

Mem: 146584K used, 332312K free, 1596K shrd, 5576K buff, 19532K cached
CPU:   0% usr   3% sys   0% nic  92% idle   0% io   0% irq   4% sirq
Load average: 0.01 0.03 0.00 1/95 23093

Below are again the tc outputs with the refined command from @moeller0

CAKE

root@lede:~# tc -s qdisc show
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8010: dev eth0 root refcnt 2 bandwidth 8Mbit besteffort flows rtt 100.0ms raw
 Sent 149387045 bytes 657814 pkt (dropped 6694, overlimits 959057 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 134892b of 4Mb
 capacity estimate: 8Mbit
                 Tin 0
  thresh         8Mbit
  target         5.0ms
  interval     100.0ms
  pk_delay       2.1ms
  av_delay       1.1ms
  sp_delay         6us
  pkts          664508
  bytes      159521761
  way_inds           0
  way_miss          75
  way_cols           0
  drops           6694
  marks              0
  sp_flows           1
  bk_flows           1
  un_flows           0
  max_len        15140

qdisc ingress ffff: dev eth0 parent ffff:fff1 ----------------
 Sent 1586305031 bytes 1173001 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 10685655661 bytes 7740651 pkt (dropped 0, overlimits 0 requeues 424)
 backlog 0b 0p requeues 424
  maxpacket 27252 drop_overlimit 0 new_flow_count 296056 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc cake 8011: dev ifb4eth0 root refcnt 2 bandwidth 90Mbit besteffort flows rtt 100.0ms raw
 Sent 1620233487 bytes 1172114 pkt (dropped 887, overlimits 783643 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 271220b of 4500000b
 capacity estimate: 90Mbit
                 Tin 0
  thresh        90Mbit
  target         5.0ms
  interval     100.0ms
  pk_delay        31us
  av_delay         3us
  sp_delay         2us
  pkts         1173001
  bytes     1621576405
  way_inds           0
  way_miss          77
  way_cols           0
  drops            887
  marks              0
  sp_flows           1
  bk_flows           1
  un_flows           0
  max_len        36336

root@lede:~# tc -d -s class show dev eth0 ; tc -d -s class show dev ifb4eth0
class cake 8010:2ee parent 8010:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 162 count 9 lastcount 0 ldelay 0us
class cake 8010:315 parent 8010:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 139 count 0 lastcount 0 ldelay 0us
class cake 8011:272 parent 8011:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit -60 count 0 lastcount 0 ldelay 0us
class cake 8011:315 parent 8011:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 1322 count 0 lastcount 0 ldelay 0us

SIMPLE / FQ_CODEL

root@lede:~# tc -s qdisc show
qdisc noqueue 0: dev lo root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 1: dev eth0 root refcnt 2 r2q 10 default 12 direct_packets_stat 0 direct_qlen 1000
 Sent 148811573 bytes 537433 pkt (dropped 5082, overlimits 758668 requeues 7)
 backlog 0b 0p requeues 7
qdisc fq_codel 110: dev eth0 parent 1:11 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 2807 bytes 31 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 114 drop_overlimit 0 new_flow_count 31 ecn_mark 0
  new_flows_len 1 old_flows_len 0
qdisc fq_codel 120: dev eth0 parent 1:12 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 148808766 bytes 537402 pkt (dropped 5082, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 36336 drop_overlimit 0 new_flow_count 139173 ecn_mark 0
  new_flows_len 1 old_flows_len 1
qdisc fq_codel 130: dev eth0 parent 1:13 limit 1001p flows 1024 quantum 300 target 5.0ms interval 100.0ms ecn
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 0 drop_overlimit 0 new_flow_count 0 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc ingress ffff: dev eth0 parent ffff:fff1 ----------------
 Sent 1292312571 bytes 958573 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 0: dev eth1 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 12005056876 bytes 8698425 pkt (dropped 0, overlimits 0 requeues 502)
 backlog 0b 0p requeues 502
  maxpacket 27252 drop_overlimit 0 new_flow_count 340656 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc noqueue 0: dev br-lan root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan0 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc noqueue 0: dev wlan1 root refcnt 2
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc htb 1: dev ifb4eth0 root refcnt 2 r2q 10 default 10 direct_packets_stat 0 direct_qlen 32
 Sent 1319696889 bytes 957337 pkt (dropped 647, overlimits 703953 requeues 0)
 backlog 0b 0p requeues 0
qdisc fq_codel 110: dev ifb4eth0 parent 1:10 limit 1001p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 1319696889 bytes 957337 pkt (dropped 647, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 28766 drop_overlimit 0 new_flow_count 91434 ecn_mark 0
  new_flows_len 1 old_flows_len 5
root@lede:~# tc -d -s class show dev eth0 ; tc -d -s class show dev ifb4eth0
class htb 1:11 parent 1:1 leaf 110: prio 1 quantum 1500 rate 128Kbit ceil 2666Kbit linklayer ethernet burst 1600b/1 mpu 0b overhead 0b cburst 1599b/1 mpu 0b overhead 0b level 0
 Sent 3367 bytes 33 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 33 borrowed: 0 giants: 0
 tokens: 1087890 ctokens: 52228

class htb 1:1 root rate 8Mbit ceil 8Mbit linklayer ethernet burst 1600b/1 mpu 0b overhead 0b cburst 1600b/1 mpu 0b overhead 0b level 7
 Sent 148815420 bytes 537457 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 374409 borrowed: 0 giants: 0
 tokens: 17406 ctokens: 17406

class htb 1:10 parent 1:1 prio 0 quantum 1500 rate 8Mbit ceil 8Mbit linklayer ethernet burst 1600b/1 mpu 0b overhead 0b cburst 1600b/1 mpu 0b overhead 0b level 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 25000 ctokens: 25000

class htb 1:13 parent 1:1 leaf 130: prio 3 quantum 1500 rate 1333Kbit ceil 7984Kbit linklayer ethernet burst 1599b/1 mpu 0b overhead 0b cburst 1598b/1 mpu 0b overhead 0b level 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 150031 ctokens: 25046

class htb 1:12 parent 1:1 leaf 120: prio 2 quantum 1500 rate 1333Kbit ceil 7984Kbit linklayer ethernet burst 1599b/1 mpu 0b overhead 0b cburst 1598b/1 mpu 0b overhead 0b level 0
 Sent 148812053 bytes 537424 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 129295 borrowed: 374409 giants: 0
 tokens: 141966 ctokens: 23699

class fq_codel 110:184 parent 110:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit -186 count 0 lastcount 0 ldelay 7us
class fq_codel 120:245 parent 120:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 214 count 0 lastcount 0 ldelay 8us
class fq_codel 120:31f parent 120:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 110 count 0 lastcount 0 ldelay 6us
class fq_codel 120:332 parent 120:
 (dropped 1, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 182 count 1 lastcount 1 ldelay 9us
class htb 1:10 parent 1:1 leaf 110: prio 0 quantum 10500 rate 90Mbit ceil 90Mbit linklayer ethernet burst 1586b/1 mpu 0b overhead 0b cburst 1586b/1 mpu 0b overhead 0b level 0
 Sent 1319703827 bytes 957362 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 562061 borrowed: 0 giants: 0
 tokens: 1581 ctokens: 1581

class htb 1:1 root rate 90Mbit ceil 90Mbit linklayer ethernet burst 1586b/1 mpu 0b overhead 0b cburst 1586b/1 mpu 0b overhead 0b level 7
 Sent 1319703827 bytes 957362 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 1581 ctokens: 1581

class fq_codel 110:14d parent 110:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 1056 count 0 lastcount 0 ldelay 10us
class fq_codel 110:2cf parent 110:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 453 count 0 lastcount 0 ldelay 6us
class fq_codel 110:38f parent 110:
 (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  deficit 1396 count 0 lastcount 0 ldelay 11us
root@lede:~#
tohojo commented 7 years ago

Okay, a couple of things, not sure which ones are significant:

hnyman commented 7 years ago

I got similar results with layer_cake and tripleisolated_llt_cake, so that is likely not the reason. I just wanted to keep things simple and made the last test with piece_cake.

I will test that offload setting.

But as a side note, they have noticed the same thing for R7800 in the dd-wrt forum. Discussion e.g. at http://www.dd-wrt.com/phpBB2/viewtopic.php?p=1050453#1050453
So, this does not sound like an isolated problem at my site.

tohojo commented 7 years ago

Well in that case it sounds like it has something to do with the way HTB runs on that CPU. You could try if TBF has the same behaviour; enable sqm, then issue the following commands to replace the configured qdiscs with a TBF-based one (provided TBF is in LEDE; not sure if it is):

tc qdisc del dev eth0 root
tc qdisc del dev ifb4eth0 root
tc qdisc add dev eth0 handle 1: root tbf rate 8Mbit burst 15140 latency 100ms
tc qdisc add dev ifb4eth0 handle 1: root tbf rate 90Mbit burst 15140 latency 100ms
tc qdisc add dev eth0 handle 2: parent 1:1 fq_codel
tc qdisc add dev ifb4eth0 handle 2: parent 1:1 fq_codel
hnyman commented 7 years ago

Great suggestion!!!!

TBF is part of kmod_sched (like HTB), so I have it on my build. I did a quick and dirty Ookla speedtest and the resulting change is impressive:

So, an immediate jump of 8 Mbit/s in the download speed and now the throughput is much closer to the set limit of 90 / 8.

tohojo commented 7 years ago

Aha! Well, I guess the culprit is something in the convoluted mess that is HTB. No idea what.

Unfortunately, I think we need some of the features that HTB adds on top of TBF for simple.qos; but we can maybe use TBF for simplest.qos.

Just to double-check, what happens if you just run a vanilla simplest.qos (i.e. HTB with only one priority tier)?

hnyman commented 7 years ago

applying "simplest" and doing the speedtest gives again 77 / 7.5. So HTB's performance hit is visible there.

I will test the new strategies (simplest and your commands) again with flent to get a better picture, but HTB's weak performance on this platform might be the reason.

tohojo commented 7 years ago

Right. Note that the TBF command above also used another 'burst' parameter than what is used for HTB, which may also be the reason for the performance difference. You can check this by using the same burst values for TBF as you see in the HTB output...

tohojo commented 7 years ago

@moeller0 do you have any opinion on whether it would be feasible to get rid of HTB in favour of TBF in simplest.qos? Not sue if the stab and linklayer stuff is supported by TBF?

hnyman commented 7 years ago

Just looked the htb code at the kernel.org site and this commit popped up: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/net/sched/sch_htb.c?id=a9efad8b24bd22616f6c749a6c029957dc76542b

net_sched: avoid too many hrtimer_start() calls

I found a serious performance bug in packet schedulers using hrtimers. sch_htb and sch_fq are definitely impacted by this problem. ... This issue is particularly visible when multiple cpus can queue/dequeue packets on the same qdisc, as hrtimer code has to lock a remote base.

The commit is in the stable kernel 4.8 but not in kernel 4.4 that LEDE uses. And my R7800 is a dual-core device...

tohojo commented 7 years ago

Nice find. Looks like it could be related. Looks fairly straight-forward to backport, so if you're feeling adventurous, maybe stick it in your LEDE build and see if it helps? :)

hnyman commented 7 years ago

I applied that HTB patch from Linux upstream and it seems to bring some 1.5-2.0 Mbit/s improvement. So it helps somewhat to decrease the gap.

There seem to also other patches for HTB in upstream, but I have not yet tried to look at them more closely.

moeller0 commented 7 years ago

"@moeller0 do you have any opinion on whether it would be feasible to get rid of HTB in favour of TBF in simplest.qos? Not sue if the stab and linklayer stuff is supported by TBF?"

Hi Toke, sorry took me a while to see this. stab and linklayer should work with TBF (but I have not tested that) and for testing having a simplest_TBF.qos might be a good idea. I would not really go change the bowels of simplest.qos or simple.qos at this point in time (except for fixing real bugs) as that would invalidate all/most reference tests so far...

Edit: spelling fixes

tohojo commented 7 years ago

Right. I pushed a simplest_tbf.qos script that tries this. Testing welcome - in particular, @hnyman could you check if this gives the same performance as your previous test?

hnyman commented 7 years ago

I built SQM from https://github.com/tohojo/sqm-scripts/commit/699f5d014b23790a3ac4e8280a14ac740cd5c3e4 and tested briefly.

With Ookla speedtest the simplest_tbf performed ok and catched up cake pretty well:

Ookla speedtest
LIMITS:          90   / 8

simple           80.0 / 7.8
simplest         78.5 / 7.9
simplest_tbf     85.3 / 7.4
piece_cake       85.5 / 7.5
layer_cake       85.5 / 7.3
triple_llt_cake  85.4 / 7.5
default_cake     85.5 / 7.6

(default_cake is cake with as much current cake defaults as possible)

htb/fq_codel also also performs much better in the speedtest. The difference to cake is not that big any more. The htb patch discussed above has been integrated into LEDE and has improved htb performace somewhat. The build also includes several other scheduling performance backports from @nbd that are in his staging tree as https://git.lede-project.org/?p=lede/nbd/staging.git;a=commit;h=ded9d3dc303da8752eef0a4ae910f8a325dd21ec

(One factor might also be musl-1.1.16 clib, which this current build is based on.)

With flent I tested simple, simplest_tbf and triple_cake. There is still a clear performance hit for simple. Much more pronounced than in Ookla speedtest:

flent
LIMITS:          90 / 8 / latency

simple           70 / 7 / 20
simplest_tbf     84 / 6 / 18
triple_llt_cake  85 / 6 / 19

Apparently something still makes it to calculate the correct throttling wrong when the traffic is more complex like with flent. But simplest_tbf performs well.

Just to highlight that this phenomenom is not about the router's CPU power or line conditions, I tested also with download limit raised to 110 Mbit/s and decreased to 60 Mbit/s:

flent 
simple with 110/8 :    79 / 6 / 20
simple with 90/8 :     70 / 7 / 20
simple with 60/8 :     53 / 7 / 20

The increase of the download speed limit by 20 Mbit/s increased the actual tested speed by 9 Mbit/s. And decrease by 30 decreased tested speed by 17. It almost looks like the system calculates the "true speed limit" like (LIMIT/2)+25 for the htb-based strategies. Strange. Hopefully things improve with newer kernels.

tohojo commented 7 years ago

Awesome, thanks for testing! So it seems we are winning with this update, then... Though the HTB behaviour does seem buggy as you say.

I'll push an update to the sqm-scripts package to get this some wider testing :)

EricLuehrsen commented 7 years ago

@hnyman did you also try UCI option shaper_burst 1. This allows HTB to hop interrupt gaps with 1 [ms] of data. HTB is rather strict about metering and this gives it an overflow bucket.

@tohojo I have discovered a minor design limit in simple/simplest. The top-class shaper has 'rate' and 'ceil' equal. No closed loop controller likes that. If we intend to control to a target, then we must allow margins on both sides of the target. I suggest that for any shaper, the hard limit on the top-class be 2% more than the target DL or UL rate. (examples, hfsc_lite, hfsc_litest, and simple_home).

tohojo commented 7 years ago

(Think this discussion belongs here, so repeating my last post to the pull request)

As far as the ceil/rate settings, ceil seems to govern both the max burst rate, but also the borrowing from other classes. But it is possible to set burst without setting ceil, which would render the former meaningless? And there's also cburst. So it's not clear at all to me what the right thing to do is... :/

hnyman commented 7 years ago

did you also try UCI option shaper_burst 1.

I got just back from vacation and tested this option. It makes a clear improvement at least with the Ookla speedtest. I took the changes by @EricLuehrsen from https://github.com/tohojo/sqm-scripts/commit/5bc27cc758407ba42fb565f23a0cb144b465127d and applied that to a live LEDE system.

That change had a huge impact on simple.qos performance in my R7800. With speed limits 85/10, the test results jumped from 73.6/9.7 to 80.0/9.6 bringing simple.qos much closer to the cake performance.

Using the same 90/8 limit that I used three weeks ago, I now get 84.2/7.8 instead of 80.0/7.8, so it looks good.

I will test with flent, but so far it looks like @EricLuehrsen spotted a really good change.

EricLuehrsen commented 7 years ago

Yeah I have been digging into it more. The current proposal is just emperical and conservative... and some of it is actually in the man page. Engineers never read the directions. Looking further at how HTB actually works under the hood, it seems we can tweak things to do better. I'll try to put something together.

As a controls design I like cake/hfsc in that they are time based. htb/tbf are a problem in that they are period based so you need to know and assume a constant period.

dtaht commented 7 years ago

@hnyman - the core question to me is not the improved bandwidth but the induced latency at the bandwidth?

hnyman commented 7 years ago

@dtaht The change has no visible effect on latency. Latency has been stable also earlier and was also after this change.

For me the core question is the apparent throttling miscalculation in HTB in my dual-core system that renders the normal advice in style of "set the limit to 90% of your bandwidth" worthless for the default "simple" script.

hnyman commented 7 years ago

I tested with flent the change from 5bc27cc758407b and things were improved also according to flent:

settings:               download / upload / latency:

htbburst1 simple 90/8 :   78 / 6.5 / 20
htbburst1 simple 110/8:   96 / 6.5 / 21

simple 90/8 :             70 / 6.5 / 20
simple 110/8 :            78 / 6.5 / 20

Latency is roughly similar in all four cases.

With the new htb_burst 1 option, the measured download speed is much closer to the set limit. Still not at cake's level, but much closer.

And the increase of the speed limit by 20 Mbit/s increased the measured speed by 16 Mbit/s with htb burst 1, which is much more logical than the change of 8 Mbit/s without that option.

so far the htb_burst 1 option looks good :-)

tohojo commented 7 years ago

Right, pushed an update to the sqm-scripts package for LEDE (including for 17.01). Should be in the next nightly rebuild...

dtaht commented 7 years ago

Groovy. Now tohojo needs to redo two papers with this setting.

/me hides

hnyman commented 7 years ago

Right, pushed an update to the sqm-scripts package for LEDE (including for 17.01). Should be in the next nightly rebuild...

I think that your commit https://github.com/openwrt/packages/commit/a84d421b18ff3811a97a2eec33075cb04e2bdd34 will break SQM for Openwrt as kmod-sched-cake has not been added to Openwrt. Cake exists only in LEDE repo. Also "tc" is unpatched in Openwrt... Packages feed is common to both.

The LEDE/Openwrt separation is a headache for package maintenance :-(

tohojo commented 7 years ago

Well, better get kmod-sched-cake into openwrt as well, then? ;)

hnyman commented 7 years ago

Well, better get kmod-sched-cake into openwrt as well, then? ;)

That is a hint for @kdarbyshirebryant , right? (as he is the maintainer of kmod-sched-cake in LEDE repo)

Openwrt development is pretty much dead, which is bad by itself. But we have still tried to keep LuCI and Packages feed repos compatible with both LEDE and Openwrt.

tohojo commented 7 years ago

Well, as far as I'm concerned, if the project is dead, bitrot is expected behaviour and not something I'm inclined to go out of my way to avoid...

EricLuehrsen commented 7 years ago

HTB needs, I mean needs the burst parameter configured to ceil applied over a standard time based interrupt (1ms). Its in the man. page for an i386 with 10ms time slices and 10T Ethernet. Otherwise it cannot borrow against RATE derived tokens (cash only sales) and struggles to achieve ceil. ceil has cburst to line-rate/infinity which should be at least an MTU and maybe 1/2 burst, just so that again token-debt calculations don't round down to underachieving. So the most you buffer is one regular OS cycle, which is the minimum possible delay anyway.

This is where I say this is a defect in HTB for modern times. No one (should) write presumed period based closed loop controls. That was a thing long, long past. A although a buffer should be a fixed size to prevent delay (not grow on an arbitrary long interupt), but HTB should ask the OS or take long-term averages. burst/cburst are really fixed/necessary values.

hfsc/cake calculate their rates in real-ish-time. Not a presumed period. Also the delay parameters are in time so as to back calculate the amount of feed-forward packet delivery for credit-operations. However it appears both hfsc/cake allow for arbitrary interrupts to estimate the burst, then in turn causes delay/bloat on weak hardware trying to keep pace with ISP upgrades.

EricLuehrsen commented 7 years ago

Yes. That means the current burst calculation I put in awhile ago is suboptimal. But its conservative and like I said, engineers don't read directions ... :-)

moeller0 commented 7 years ago

HI Eric,

On Jan 23, 2017, at 04:24, Eric Luehrsen notifications@github.com wrote:

HTB needs, I mean needs the burst parameter configured to ceil applied over a standard time based interrupt (1ms). Its in the man. page for an i386 with 10ms time slices and 10T Ethernet. Otherwise it cannot borrow against RATE derived tokens (cash only sales) and struggles to achieve ceil.

But it is this borrowing against ceil that introduces unwanted delay under load increase. I believe that HTB actually uses default burrts and cburst parameters close to a full MTU if not specified explicitly, THe big question is why is nominally faster hardware struggling hard with our minimal burts, while much older hardware like the wndr3700v2/3800 work pretty well. I am not disputing that adding a bigger busts might be a good thing if it allows better performance for more hardware at a bounded small latency under load increase, all I want to point out is that it does not seem to be as clear cut as you seem to argue at the moment. Or put differently, why do you believe this issue did not materialize as noticeably with the older generation of routers?

Best Regards Sebastian

ceil has cburst to line-rate/infinity which should be at least an MTU and maybe 1/2 burst, just so that again token-debt calculations don't round down to underachieving. So the most you buffer is on regular OS cycle, which is the minimum possible delay anyway.

This were I say this is a defect in HTB for modern times. No one (should) write presumed period based closed loop controls. That was a thing long, long past. Now a buffer should be a fixed size to prevent delay (not grow on an arbitrary long interupt), but HTB should ask the OS or take long-term averages. burst/cburst are really fixed/necessary values.

hfsc/cake calculate their rates in real-ish-time. Not a presumed period. Also the delay parameters are in time so as to back calculate the amount of feed-forward packet delivery for credit-operations. However it appears both hfsc/cake allow for arbitrary interrupts to estimate the burst, then in turn causes delay/bloat on weak hardware trying to keep pace with ISP upgrades.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tohojo commented 7 years ago

moeller0 notifications@github.com writes:

HI Eric,

On Jan 23, 2017, at 04:24, Eric Luehrsen notifications@github.com wrote:

HTB needs, I mean needs the burst parameter configured to ceil applied over a standard time based interrupt (1ms). Its in the man. page for an i386 with 10ms time slices and 10T Ethernet. Otherwise it cannot borrow against RATE derived tokens (cash only sales) and struggles to achieve ceil.

But it is this borrowing against ceil that introduces unwanted delay under load increase. I believe that HTB actually uses default burrts and cburst parameters close to a full MTU if not specified explicitly, THe big question is why is nominally faster hardware struggling hard with our minimal burts, while much older hardware like the wndr3700v2/3800 work pretty well. I am not disputing that adding a bigger busts might be a good thing if it allows better performance for more hardware at a bounded small latency under load increase, all I want to point out is that it does not seem to be as clear cut as you seem to argue at the moment. Or put differently, why do you believe this issue did not materialize as noticeably with the older generation of routers?

Higher base bandwidth? An MTU-sized burst at 100 Mbps is not much...

-Toke

EricLuehrsen commented 7 years ago

Yes. That. 12Mbps can move only 1500B (MTU) in 1ms and this burst works until the time to process event interupts is more than BURST/RATE. At 120Mbps you would need to feed the interface every 100 [us] ... good luck keeping up with that.

My TL WDR3600 was happy at 30Mbps or about 0.4 [ms], but it fell on its face when trying to keep up at 50+Mbps or 0.2 [ms].

moeller0 commented 7 years ago

Hi Eric,

On Jan 23, 2017, at 14:26, Eric Luehrsen notifications@github.com wrote:

Yes. That. 12Mbps can move only 1500B (MTU) in 1ms and this burst works until the time to process event interupts is more than BURST/RATE. At 120Mbps you would need to feed the interface every 100 [us] ... good luck keeping up with that.

My TL WDR3600 was happy at 30Mbps or about 0.4 [ms], but it fell on its face when trying to keep up at 50+Mbps or 0.2 [ms].

This matches data from my wndr3700v2 that I was looking at just now (not a 100% but qualitatively). As I understand it the time required to transfer burst bytes is the time that latency under load will increse. So maybe exposing burst duration in the GUI will work in an intuitive way, so users can make an informed decision about how much added latency they are willing to trade for more available bandwidth?

Best Regards

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

EricLuehrsen commented 7 years ago

That isnt likely necessary. Regular time period interupt is 1ms. So just calclulate on that. If the hardware cannot keep up, then it cannot, period. According to US Secuities Exchange Commission (SEC) when an exchange proposed a 350 [us] speed bump to deal with inappropriate high frequency trading behaviors, 1 [ms] is trivial ... and thats for real money.

dtaht commented 7 years ago

HTB got a lot of fixes to it's architecture since Linux 3.3 that are not reflected in the man page. If you do a git log net/sched/sch_htb.c you will see them.

See for example commit: a9efad8b24bd22616f6c749a6c029957dc76542b

1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding

Certainly it has not had 1ms resolution for a good long time, I think we improved that way back in the 3.4 or 3.6 era. It was badly borked for gro also around that time. HTB as is is heavily used for traffic management inside of google/facebook/etc, so it has seen a lot of love.

As for the ceil/burst thing - damned if I know the right thing, everybody's showing an improvement, I agree with your theoretical explanation and the physical evaluation - ship it! move on.

(and yes, I prefer cake's algorithm. token buckets made sense when done in hardware, not in software. token bucket is so 1998, as is the man page)

dtaht commented 7 years ago

as for the "strangeness" (is that still an issue?), I would be curious what clock sources exist on that platform, and what happens if you turn off cpu clock management (cpu_freq performance mode).

dtaht commented 7 years ago

@hnyman Are you golden? Can I close this?