lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
281 stars 24 forks source link

Multi WAN setup #161

Closed Quba1 closed 1 year ago

Quba1 commented 1 year ago

I am opening this issue to share my results of using cake-autorate with mwan3, as according to @moeller0 it is not yet well tested. And to maybe look for improvements.

Unfortunately I cannot generate graphs following instructions from README because maintain_log_file_pid file does not exist.

root@OpenWrt:~# kill -USR1 $(cat /var/run/cake-autorate/*/maintain_log_file_pid)
cat: can't open '/var/run/cake-autorate/*/maintain_log_file_pid': No such file or directory
ash: you need to specify whom to kill
rany2 commented 1 year ago

Also are you sure you're using the latest version from master (stable doesn't support instances).

Edit: never mind, you're the same user as https://github.com/lynxthecat/cake-autorate/issues/156... so I presume you did

rany2 commented 1 year ago

@lynxthecat you can't do that anymore with proc-man

rany2 commented 1 year ago

The equivalent command to the one you provided would now be: kill -USR1 $(cat /var/run/cake-autorate/*/proc_state | grep -E '^maintain_log_file=' | cut -d= -f2)

rany2 commented 1 year ago

@lynxthecat test the command I provided, if it works update the README to that...

lynxthecat commented 1 year ago

Ah! Thanks @rany2.

rany2 commented 1 year ago

Most welcome @lynxthecat

rany2 commented 1 year ago

@lynxthecat P.S. if you're going to update the README, please use kill -USR1 $(awk -F= '/^maintain_log_file=/ {print $2}' /var/run/cake-autorate/*/proc_state)

I hate unnecessarily piped commands, above is more concise... sorry about the above grep pipe mess; very ugly :')

moeller0 commented 1 year ago

How about writing this beauty into a script called export_log_for_this_instance.sh so users simply need to call /var/run/cake-autorate/primary/export_log_for_this_instance.sh?

rany2 commented 1 year ago

@moeller0 sure, I'll make a PR for that. Makes sense UX wise.

lynxthecat commented 1 year ago

I also thought a script would be a good idea but I didn't think of placing inside run directory for each instance. Really nice idea!

moeller0 commented 1 year ago

One thing that would be nice, is if the script would "block" until the file was written. At the moment (with version 1.1) I use:

ssh root@192.168.42.1 'kill -USR1 $( cat /var/run/cake-autorate/*/maintain_log_file_pid )'  && sleep 30 && scp  root@192.168.42.1:/var/log/cake-autorate*.log.gz ./SCRATCH/ && ssh root@192.168.42.1 'rm /var/log/cake-autorate*.log.gz'

where the sleep 30 empirically gives enugh time for the log to be written. If export_log_for_this_instance.sh would only terminate once the file was written, I could get rid of that arbitrary sleep.

rany2 commented 1 year ago

@moeller0 I'll make sure to add that

Quba1 commented 1 year ago

Thank you guys, for your suggestions. With your help I was able to extract log files and generate plots. But the issue now is that for secondary instance there's much less plots.

octave -qf --eval 'fn_parse_autorate_log("./cake-autorate.primary_2023_02_17_09_33_30.log.gz", "./primary.png")'

primary deltaCDFs primary rawCDFs primary timecourse

octave -qf --eval 'fn_parse_autorate_log("./cake-autorate.secondary_2023_02_17_09_33_30.log.gz", "./secondary.png")'

secondary timecourse sample_0001_to_5356

Looking into extracted log files it seems that secondary instance is not printing DATA statements.

$ head cake-autorate.primary_2023_02_17_09_33_30.log
DATA_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; DL_LOAD_PERCENT; UL_LOAD_PERCENT; RTT_TIMESTAMP; REFLECTOR; SEQUENCE; DL_OWD_BASELINE; DL_OWD_US; DL_OWD_DELTA_EWMA_US; DL_OWD_DELTA_US; DL_ADJ_DELAY_THR; UL_OWD_BASELINE; UL_OWD_US; UL_OWD_DELTA_EWMA_US; UL_OWD_DELTA_US; UL_ADJ_DELAY_THR; SUM_DL_DELAYS; SUM_UL_DELAYS; DL_LOAD_CONDITION; UL_LOAD_CONDITION; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS
LOAD_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS
REFLECTOR_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; REFLECTOR; DL_MIN_BASELINE_US; DL_BASELINE_US; DL_BASELINE_DELTA_US; DL_BASELINE_DELTA_THR_US; DL_MIN_DELTA_EWMA_US; DL_DELTA_EWMA_US; DL_DELTA_EWMA_DELTA_US; DL_DELTA_EWMA_DELTA_THR; UL_MIN_BASELINE_US; UL_BASELINE_US; UL_BASELINE_DELTA_US; UL_BASELINE_DELTA_THR_US; UL_MIN_DELTA_EWMA_US; UL_DELTA_EWMA_US; UL_DELTA_EWMA_DELTA_US; UL_DELTA_EWMA_DELTA_THR
DEBUG; 2023-02-17-09:16:44; 1676625404.591062; Starting: rotate_log_file with PID: 5010
DATA; 2023-02-17-09:16:44; 1676625404.625222; 1676625404.622669; 22; 110; 0; 0; 1676625404.588150; 156.154.70.1; 71367; 25562; 28050; 4199; 2488; 31029; 25562; 28050; 4199; 2488; 31029; 0; 0; dl_idle; ul_idle; 12000; 12000
DEBUG; 2023-02-17-09:16:44; 1676625404.632806; Starting: print_headers with PID: 5010
DATA; 2023-02-17-09:16:44; 1676625404.660458; 1676625404.658500; 22; 110; 0; 0; 1676625404.620380; 8.8.4.4; 71367; 11911; 15900; 4631; 3988; 31029; 11911; 15900; 4631; 3988; 31029; 0; 0; dl_idle; ul_idle; 12000; 12000
LOAD; 2023-02-17-09:16:44; 1676625404.692696; 1676625404.691293; 26; 21; 12000; 12000
DATA; 2023-02-17-09:16:44; 1676625404.711459; 1676625404.708853; 26; 21; 0; 0; 1676625404.670110; 208.67.220.2; 71367; 12155; 15150; 4619; 2994; 31029; 12155; 15150; 4619; 2994; 31029; 0; 0; dl_idle; ul_idle; 12000; 12000
DATA; 2023-02-17-09:16:44; 1676625404.771641; 1676625404.769475; 26; 21; 0; 0; 1676625404.740450; 94.140.14.140; 71367; 17061; 20300; 3839; 3239; 31029; 17061; 20300; 3839; 3239; 31029; 0; 0; dl_idle; ul_idle; 12000; 12000
$ head cake-autorate.secondary_2023_02_17_09_33_30.log
DATA_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; DL_LOAD_PERCENT; UL_LOAD_PERCENT; RTT_TIMESTAMP; REFLECTOR; SEQUENCE; DL_OWD_BASELINE; DL_OWD_US; DL_OWD_DELTA_EWMA_US; DL_OWD_DELTA_US; DL_ADJ_DELAY_THR; UL_OWD_BASELINE; UL_OWD_US; UL_OWD_DELTA_EWMA_US; UL_OWD_DELTA_US; UL_ADJ_DELAY_THR; SUM_DL_DELAYS; SUM_UL_DELAYS; DL_LOAD_CONDITION; UL_LOAD_CONDITION; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS
LOAD_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; DL_ACHIEVED_RATE_KBPS; UL_ACHIEVED_RATE_KBPS; CAKE_DL_RATE_KBPS; CAKE_UL_RATE_KBPS
REFLECTOR_HEADER; LOG_DATETIME; LOG_TIMESTAMP; PROC_TIME_US; REFLECTOR; DL_MIN_BASELINE_US; DL_BASELINE_US; DL_BASELINE_DELTA_US; DL_BASELINE_DELTA_THR_US; DL_MIN_DELTA_EWMA_US; DL_DELTA_EWMA_US; DL_DELTA_EWMA_DELTA_US; DL_DELTA_EWMA_DELTA_THR; UL_MIN_BASELINE_US; UL_BASELINE_US; UL_BASELINE_DELTA_US; UL_BASELINE_DELTA_THR_US; UL_MIN_DELTA_EWMA_US; UL_DELTA_EWMA_US; UL_DELTA_EWMA_DELTA_US; UL_DELTA_EWMA_DELTA_THR
DEBUG; 2023-02-17-09:14:35; 1676625275.281440; Starting: rotate_log_file with PID: 4987
DEBUG; 2023-02-17-09:14:35; 1676625275.294303; Starting: print_headers with PID: 4987
LOAD; 2023-02-17-09:14:35; 1676625275.463187; 1676625275.461477; 164; 22; 12000; 12000
LOAD; 2023-02-17-09:14:35; 1676625275.670894; 1676625275.669904; 12; 5; 12000; 12000
LOAD; 2023-02-17-09:14:35; 1676625275.878370; 1676625275.877441; 0; 0; 12000; 12000
LOAD; 2023-02-17-09:14:36; 1676625276.091130; 1676625276.089822; 0; 0; 12000; 12000
LOAD; 2023-02-17-09:14:36; 1676625276.299845; 1676625276.298780; 0; 0; 12000; 12000

But my config files for both instances are identical:

root@OpenWrt:~/cake-autorate# cat cake-autorate_config.primary.sh
#!/bin/bash

# *** STANDARD CONFIGURATION OPTIONS ***

### For multihomed setups, it is the responsibility of the user to ensure that the probes
### sent by this instance of cake-autorate actually travel through these interfaces.
### See ping_extra_args and ping_prefix_string

dl_if=ifb4eth0 # download interface
ul_if=eth0     # upload interface

# Set either of the below to 0 to adjust one direction only
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=3000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=12000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=60000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=3000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=12000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=40000  # maximum bandwidth for upload (Kbit/s)

# *** OVERRIDES ***

### See cake-autorate_defaults.sh for additional configuration options
### that can be set in this configuration file to override the defaults.
### Place any such overrides below this line.

ping_prefix_string="mwan3 use wan exec"

# *** DO NOT EDIT BELOW THIS LINE ***

config_file_check="cake-autorate"
root@OpenWrt:~/cake-autorate# cat cake-autorate_config.secondary.sh
#!/bin/bash

# *** STANDARD CONFIGURATION OPTIONS ***

### For multihomed setups, it is the responsibility of the user to ensure that the probes
### sent by this instance of cake-autorate actually travel through these interfaces.
### See ping_extra_args and ping_prefix_string

dl_if=ifb4eth4 # download interface
ul_if=eth4     # upload interface

# Set either of the below to 0 to adjust one direction only
# or alternatively set both to 0 to simply use cake-autorate to monitor a connection
adjust_dl_shaper_rate=1 # enable (1) or disable (0) actually changing the dl shaper rate
adjust_ul_shaper_rate=1 # enable (1) or disable (0) actually changing the ul shaper rate

min_dl_shaper_rate_kbps=3000  # minimum bandwidth for download (Kbit/s)
base_dl_shaper_rate_kbps=12000 # steady state bandwidth for download (Kbit/s)
max_dl_shaper_rate_kbps=100000  # maximum bandwidth for download (Kbit/s)

min_ul_shaper_rate_kbps=3000  # minimum bandwidth for upload (Kbit/s)
base_ul_shaper_rate_kbps=12000 # steady state bandwidth for upload (KBit/s)
max_ul_shaper_rate_kbps=50000  # maximum bandwidth for upload (Kbit/s)

# *** OVERRIDES ***

### See cake-autorate_defaults.sh for additional configuration options
### that can be set in this configuration file to override the defaults.
### Place any such overrides below this line.

ping_prefix_string="mwan3 use wanb exec"

# *** DO NOT EDIT BELOW THIS LINE ***

config_file_check="cake-autorate"
moeller0 commented 1 year ago

You need to add additional lines from the default configs to your individual config files (not sure that the split in cake-autorate_defaults.sh and overrides in additional config files is as intuitive as we hoped for):

### In multi-homed setups, it is mandatory to use either ping_extra_args
### or ping_prefix_string to direct the pings through $dl_if and $ul_if.
### No universal recommendation exists, because there are multiple
### policy-routing packages available (e.g. vpn-policy-routing and mwan3).
### Typically they either react to a firewall mark set on the pings, or
### provide a convenient wrapper.
###
### In a traditional single-homed setup, there is usually no need for these parameters.
###
### These arguments can also be used for any other purpose - e.g. for setting a
### particular QoS mark.

# extra arguments for ping or fping
# e.g., here is how you can set the correct outgoing interface and
# the firewall mark for ping:
# ping_extra_args="-I wwan0 -m $((0x300))"
# Unfortunately, fping does not offer a command line switch to set
# the firewall mark.
# WARNING: no error checking so use at own risk!
ping_extra_args=""

# a wrapper for ping binary - used as a prefix for the real command
# e.g., when using mwan3, it is recommended to set it like this:
# ping_prefix_string="mwan3 use gpon exec"
# WARNING: the wrapper must exec ping as the final step, not run it as a subprocess.
# Running ping or fping as a subprocess will lead to problems stopping it.
# WARNING: no error checking - so use at own risk!
ping_prefix_string=""

you probably need to set ping_prefix_string appropriately in each of your two configuration files...

Quba1 commented 1 year ago

I have those added:

ping_prefix_string="mwan3 use wan exec"

ping_prefix_string="mwan3 use wanb exec"

Running cake-autorate manually for each of those config files works fine, and DATA statements are printed in each case. Problem with logs is when running cake-autorate as a service.

Do you think that lack of DATA statements indicate there's an issue with running cake-autorate for wanb?

moeller0 commented 1 year ago

I never ran autorate as service, so I am not sure how/why....

Quba1 commented 1 year ago

Looking at the code service file just runs cake-autorate_launcher.sh which starts two instances of cake-autorate

for cake_instance in "${cake_instances[@]}"
do
    /root/cake-autorate/cake-autorate.sh "${cake_instance}" &
    cake_instance_pids+=(${!})
done

So, in theory it should work just fine.

Edit: I have restarted the service and rebooted router and both instances are printing DATA now. I have no idea what has changed...

lynxthecat commented 1 year ago

@Quba1 you are the first person to document your experiences with testing this multiple instance functionality that we built into cake-autorate, so consider yourself a pioneer of sorts!

Is it working OK? Can you please try to stress test it a little by starting and stopping the service and making sure that following 'stop' that the output of 'ps | grep -e bash -e fping' shows no leftover processes, and that both instances show correct log entries in their respective logs following a few such starts and stops?

In your case is only one interface active at a given time or are both interfaces used at the same time?

Does the cake bandwdith seem to be getting properly set for both interfaces in dependence upon load and latency?

Also I am curious about what the CPU load is like with running two cake-autorate instances at once?

Quba1 commented 1 year ago

@lynxthecat That's good to know. I will be able to run stress tests later today/tommorow.

For now I was able to generate plots for both instances. I have mwan3 set to balanced so traffic distribution should be approximately equal between two interfaces. Hopefully, the plots can give you more info, as I am not entirely sure how to interpret them.

One thing to note about those interfaces is that secondary is the "better" LTE: it always had better default speeds and buffer bloats.

Timecourse (top - primary, bottom - secondary):

primary timecourse

secondary timecourse

Delta CDFs (top - primary, bottom - secondary):

primary deltaCDFs

secondary deltaCDFs

Raw CDFs (top - primary, bottom - secondary):

primary rawCDFs

secondary rawCDFs

moeller0 commented 1 year ago

Your raw CDFs from the primary look a bit odd with this "kink" up to ~20%, that might indicate that the delay thresholds are higher than they would need to be...

lynxthecat commented 1 year ago

It is really cool to see mwan3 load balancing and cake with cake-autorate. So is this a dual LTE setup? From the plots your 'primary' connection seems stronger to me. Latency in the 'secondary' connection seems not so well under control. Or am I missing something?

Perhaps try slightly reducing the delay thresholds from the default '30' to '25' for either or both cake-autorate instances?

I suppose mwan3 is not sophisticated enough to balance based on latency? How to balance load across your connections in your dual wan setup now seems to me an interesting and challenging problem. I read a report from an individual who works with two LTE connections, but not using mwan3. He uses one for latency sensitive applications like VOIP/Teams/Zoom and the other for everything else. So he deliberately avoids saturating the first and doesn't care so much about saturating the second. That's one rather crude way of tackling the problem, but I do not think it facilitates getting the most out of both connections.

rany2 commented 1 year ago

I suppose mwan3 is not sophisticated enough to balance based on latency

I use mwan3 but only for emergency LTE fallback. Basically, it has the ability to avoid a WAN connection if the latency is too high or the ping packets are getting lost but it won't optimize for low latency.

lynxthecat commented 1 year ago

Interesting. So can you have cake-autorate on standby on your LTE so that when the fallback kicks in its latency is controlled?

rany2 commented 1 year ago

@lynxthecat Well, I don't have cake-autorate setup on it. I have an XPON connection so I've only had to fallback to LTE once in the past year due to maintenance work

lynxthecat commented 1 year ago

https://openwrt.org/docs/guide-user/network/wan/multiwan/mwan3#architecture_of_mwan3

https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track

Could a hotplug handling measure be implemented to instantiate cake-autorate on the fallback? So, for example, XPON goes down, mwan3 triggers hotplug handling measure that switches over to LTE and launches cake-autorate, and similarly when XPON goes back up the cake-autorate instance is killed? Or alternatively cake-autorate sits there idling on the LTE in a stall condition, waiting for load or ping responses, and then kicks in and back to stall.

rany2 commented 1 year ago

@lynxthecat Not keen on setting something like this up for the unlikely case where this happens, the LTE connection has a fixed bw limit of 8Mb down and 1Mb up. I'm only going to have a use for it once every year and for an hour at most. So it's a meh situation.

patrakov commented 1 year ago

Could a hotplug handling measure be implemented to instantiate cake-autorate on the fallback?

This is generally not needed, if there is enough CPU. Just have it run always, 24 hours a day. Most of the time, it will sleep and not consume any traffic.

lynxthecat commented 1 year ago

I think this can be closed now. If anyone disagrees I will reopen.