lynxthecat / cake-autorate

Eliminate the excess latency and jitter terrorizing your 4G, 5G, LTE, Starlink or other variable rate connection!
https://www.bufferbloat.net
GNU General Public License v2.0
288 stars 24 forks source link

Using timeprofile to profile cake-autorate #191

Closed lynxthecat closed 10 months ago

lynxthecat commented 1 year ago

I tried @jkool702's nifty timeprofile tool to profile cake-autorate.

I encountered a few issues in doing so - see here https://github.com/jkool702/timeprofile/issues/1, but nevertheless was able to generate this timeprofile file: https://easyupload.io/1tla7w.

@jkool702 does the timeprofile I generated reveal to you any particular aspects of cake-autorate that would benefit from optimisation?

From scrolling through the lines of the timeprofile it looks to me like this line:

https://github.com/lynxthecat/cake-autorate/blob/15de9aa4e0cb8dd7f6f5efadcbc227076f96fb25/cake-autorate.sh#L1953

eats up a significant number of CPU cycles.

Can you see a way to speed that up without switching over to fixed byte reads using read -N?

I know that as we discussed on reddit switching over to fixed-length reads might help, but I'm a bit reluctant to go that route for now. In general deciding on one length for all messages is a little awkward. Whilst some of the messages passed from one process to another are lengthy, there are also much smaller messages. And it is difficult to decide upon an upper limit, albeit perhaps the message length might be a user-defined variable and cake-autorate could error out on any truncation associated with, say, the use of large domain names for reflectors.

rany2 commented 1 year ago

I seriously doubt this. It probably only shows that it eats a lot of CPU cycles because it is long lived and regularly called.

lynxthecat commented 1 year ago

I think timeprofile is clever enough to actually list lines of bash code in the order they are called, together with the execution time associated with each. Since bash is single-threaded, doesn't this give an indication of the processing time associated with each line? See e.g. this excerpt form the compressed timeprofile linked above:

(0.001053): 147 -- 'printf '\''%s; %(%F-%H:%M:%S)T; %s; %s\n'\'' "${type}" -1 "${log_timestamp}" "${msg}"'
(0.000468): 2037 -- '(( enable_sleep_function ))'
(0.000305): 2038 -- '[[ ${load_condition[dl]} == *idle* && ${load_condition[ul]} == *idle* ]]'
(0.000283): 2039 -- '((t_sustained_connection_idle_us += (${EPOCHREALTIME/./}-t_end_us) ))'
(0.000234): 2040 -- '((t_sustained_connection_idle_us > sustained_idle_sleep_thr_us))'
(0.000256): 2082 -- 't_end_us="${EPOCHREALTIME/./}"'
(0.000206): 1946 -- 'true'
(0.000251): 1948 -- 'unset command'
(0.040455): 1949 -- 'read -r -u "${main_fd}" -a command'
(0.000525): 1951 -- '[[ -n "${command-}" ]]'
(0.000483): 1954 -- 'case "${command[0]}" in '
(0.001144): 1958 -- 'read -r timestamp reflector seq dl_owd_baseline_us dl_owd_us dl_owd_delta_ewma_us dl_owd_delta_us ul_owd_baseline_us(0.000440): 1989 -- 'case "${main_state}" in '
(0.000975): 1993 -- '[[ "${command[0]}" == "REFLECTOR_RESPONSE" && -n "${timestamp-}" && -n "${reflector-}" && -n "${seq-}" && -n "${dl_o(0.000464): 1996 -- 't_start_us=${EPOCHREALTIME/./}'
(0.000389): 1998 -- 'reflectors_last_timestamp_us="${timestamp//[.]}"'
(0.000417): 2000 -- '(( (t_start_us - 10#"${reflectors_last_timestamp_us}")>500000 ))'
(0.000326): 2007 -- '(( dl_delays[delays_idx] ))'
rany2 commented 1 year ago

No it doesn't, especially for the read commands. They are by definition waiting for input so it's normal for it to take some time.

lynxthecat commented 1 year ago

OK understood regarding read commands, but what about for the rest? And this read command:

 read -r timestamp reflector seq dl_owd_baseline_us dl_owd_us dl_owd_delta_ewma_us dl_owd_delta_us ul_owd_baseline_us ul_owd_us ul_owd_delta_ewma_us ul_owd_delta_us <<< "${command[@]:1}"

is reading from an array, so that data is immediately available. I think I tried manually assigning the elements based on the values from the array, but I don't think that was any faster.

rany2 commented 1 year ago

Something basic you could try is to define an IFS variable that only contains what you'd like as delimeters (in the case of cake-autorate, you want that to only contain a space). Simply defining an IFS variable at the start of the script will suffice.

rany2 commented 1 year ago

I would expect that to be heavily optimized/it doesn't matter but maybe it does, IDK

lynxthecat commented 1 year ago

Ah that's a super simple change (setting IFS to space). Would that avoid some unnecessary processing?

rany2 commented 1 year ago

Of course, but as I said.. I expect it to be heavily optimized so it might not matter.

rany2 commented 1 year ago

So just an assignment like IFS=" " (or whatever you are using as the delimiter, if it's tab then IFS=$"\t".. etc) at the start would work OK

lynxthecat commented 1 year ago

Here is just one 60s run for each using `time timeout -s INT 60 ./cake-autorate.sh.

without IFS=" ":

real    1m 1.12s
user    0m 9.30s
sys     0m 1.75s

with IFS=" ":

real    1m 1.12s
user    0m 9.45s
sys     0m 1.60s
rany2 commented 1 year ago

Somehow it's worse! :D

lynxthecat commented 1 year ago

Presumably just random variation? Would you still recommend including IFS=" "? If so, where would I put it? In cake-autorate.sh at the top somewhere (not the .lib)?

rany2 commented 1 year ago

@lynxthecat If you want to put it I'd recommend cake-autorate.sh, it doesn't seem like it hurts and would help keep delimiting consistent at least.

rany2 commented 1 year ago

Seeing you control the input, it seems like read -r incurs a performance penalty which is observed in 10k,100k,1m runs. See:

user@ubuntu:~$ time for ((i=0;i<1000000;i++)); do read l <<<l; done 

real    0m5.880s
user    0m3.115s
sys 0m2.763s
user@ubuntu:~$ time for ((i=0;i<1000000;i++)); do read -r l <<<l; done 

real    0m6.021s
user    0m3.204s
sys 0m2.816s

I don't think it's bad to stick to the variant without -r for now, should not be an issue for us really..

rany2 commented 1 year ago

user@ubuntu:~$ time for ((i=0;i<1000000;i++)); do read -r l <<<l; done

@lynxthecat This increases by 700ms if you assign IFS to a space, not worth it IMO

rany2 commented 1 year ago

Seeing you control the input, it seems like read -r incurs a performance penalty which is observed in 10k,100k,1m runs. See:

Bash is seriously unpredictable, I think I added it at the time because I learned it from shellcheck a long time ago and I know it actually causes bash to do a lot less (no escaping of input) so it doesn't make sense to me that doing less = doing more in bash :')

rany2 commented 1 year ago

Seeing you control the input, it seems like read -r incurs a performance penalty which is observed in 10k,100k,1m runs. See:

user@ubuntu:~$ time for ((i=0;i<1000000;i++)); do read l <<<l; done 

real  0m5.880s
user  0m3.115s
sys   0m2.763s
user@ubuntu:~$ time for ((i=0;i<1000000;i++)); do read -r l <<<l; done 

real  0m6.021s
user  0m3.204s
sys   0m2.816s

I don't think it's bad to stick to the variant without -r for now, should not be an issue for us really..

Of course this is on an i7-11800H, it's probably massively worse on a crappy MIPS processor

lynxthecat commented 1 year ago

Wait so what's the conclusion here? I set IFS=" " with this commit: https://github.com/lynxthecat/cake-autorate/commit/eb94cc252ec220b0a9edbe1db6e7cb9ed31fda81. Keep or discard?

rany2 commented 1 year ago

Disregard IFS and do not use read -r

rany2 commented 1 year ago

@lynxthecat actually disregard my comment about IFS, the reason it seemed slower to me is because I did IFS=" " read ... which is slower than assigning it once and then using read.

lynxthecat commented 1 year ago
root@OpenWrt-1:/# a="DATA; 2023-06-06-23:37:38; 1686091058.437920; 1686091058.436087; 11; 15; 0; 0; 1686091058.417140; 9.9.9.10; 4674; 15706; 20350; 2491; 4644; 40600; 15706; 20350; 2491; 4644; 40600; 0; 0; dl_idle; ul_idle; 20000; 20000"
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -r -a aa <<< "${a}"; done

real    0m34.624s
user    0m19.438s
sys     0m13.696s
root@OpenWrt-1:/# IFS=" "
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -r -a aa <<< "${a}"; done      

real    0m33.728s
user    0m19.128s
sys     0m14.037s
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -a aa <<< "${a}"; done

real    0m34.053s
user    0m18.961s
sys     0m13.976s
root@OpenWrt-1:/# unset IFS
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -a aa <<< "${a}"; done

real    0m33.860s
user    0m19.595s
sys     0m13.419s

root@OpenWrt-1:/# printf -v b "%-256s" "${a}"
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -N 256 bb <<< "${b}"; bb=(${bb}); done

real    0m18.417s
user    0m15.547s
sys     0m2.167s

So IFS=" " at least does not hurt.

The big gain for large messages is moving to fixed length reads, but I've been resisting that so far.

For small messages the same fixed length actually hurts:

root@OpenWrt-1:/# a="hello bye hello"
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -a aa <<< "${a}"; done

real    0m8.886s
user    0m6.076s
sys     0m2.493s
root@OpenWrt-1:/# printf -v b "%-256s" "${a}"
root@OpenWrt-1:/# time for((i=80000;i--;)); do read -N 256 bb <<< "${b}"; bb=(${bb}); done

real    0m18.425s
user    0m15.615s
sys     0m2.165s
rany2 commented 1 year ago

Maybe it's because you used -a?

rany2 commented 1 year ago

I think it's better to do these tests on more powerful machines just so that you could have more iterations than just 80k