Clock setup code pulls in almost 1K of u64-math builtins

BryanKadzban commented 4 months ago

I'm trying to write a bootloader, meaning the output code needs to be as small as possible. In looking at the big symbols being linked into the output binary, I see that compiler_builtins::int::specialized_div_rem::u64_div_rem::[hash] is almost 800 bytes, and it's only called from a couple builtins (__aeabi_uldivmod calls __udivmoddi4 which calls it). Those builtins are 24 and 22 bytes, for a total of 842. This is uncomfortably close to 1K. There is only one bigger chunk, in all the core::fmt support code, totaling over 4K -- removing that (hopefully defmt is better) is a bigger change.

The only thing in the bootloader that calls __aeabi_uldivmod is the clock setup code. It does it in three places; two which refer to this chunk of source from "cargo objdump --release -- -d --demangle --source":

sysclk = base_clk as u64 * self.plln as u64 / self.pllm as u64

and one that refers to this one:

let f_vco_clock = (f_pll_clock_input as u64 * n as u64 / m as u64) as u32;

Perhaps these divide-by-m options don't need to be u64? Or perhaps it'd be possible to open-code the 64-bit division, or perhaps do them with a shift instead of a divide. (I doubt that, since I think m can be values other than powers-of-2, but perhaps.) Or maybe represent the frequencies in kHz instead of Hz, if we're only a couple bits over the 32-bit u32 range. Or maybe do the division first (still as a u32), to avoid overflow.

I'll be looking into this, but it'd be good to know if others are thinking about it already as well.

BryanKadzban commented 4 months ago

...Or, an even better idea. The maximum numerator of the entire operation fits into 37 bits (...well, requires 38 bits, but values that big can't possibly pass the <=216MHz assert down farther), so we could fit this into a fixed-point calculation with 27 bits after the decimal place. That would let us calculate 1/m and 1/p and 1/q, each shifted left by 27 bits, in a u32. Then multiplying by this and shifting right by 27 bits will accomplish the division, without having to use the software 64-bit-divide implementation. But we keep all the important operations in a u64 space.

This does drop the calls to u64_div_rem, and saves almost 800 bytes total (we offset the savings from u64_div_rem a little, with the fixed-point shifts). Sending a PR.

BryanKadzban commented 4 months ago

OK, that's less good than I was thinking.

Doing fixed point loses too much precision. The code ends up calculating final frequencies that are up to 70Hz off from what they should be (432MHz turns into 432MHz+70Hz... that's a factor of 1.6e-7, or one bit out of 23ish). 432MHz requires 29 bits of fidelity.

Any better ideas? I'll keep thinking too. Maybe there's a way to get >29 bits after the decimal point without losing too many bits before it (and overflowing u64).

BryanKadzban commented 4 months ago

...This might work. The bottom 6 bits of 1_000_000 are all zero, so any integer MHz value can be shifted right by up to 6 bits without losing information. Setting the fractional part of the fixed-point math to 30 bits, and doing a >>4 on the input frequency to get it to fit inside the remaining 34 bits, would work, as long as people use multiples of 250kHz for their frequencies.

Setting the fractional part to 30 bits makes all the tests pass, but I still need to fix the potential overflow issue.

BryanKadzban commented 4 months ago

OK, updating the PR. I think this will work -- at least it passes all the existing tests. I added rounding as well, instead of truncating, when calculating the 1/x values.

stm32-rs / stm32f7xx-hal

Clock setup code pulls in almost 1K of u64-math builtins #211