slothy-optimizer / slothy

Assembly super-optimization via constraint solving
https://slothy-optimizer.github.io/slothy/
Other
167 stars 10 forks source link

add option for `.w` suffix boilerplating of all compressible instructions #83

Open jnk0le opened 3 months ago

jnk0le commented 3 months ago

related to #61, as I already spotted some instances of compressible but .w instructions used in some inputs, "for no reason".

Certain microarchitectures may suffer performance degradation due to the use of compressed instructions. In order to avoid it and resulting false positives/negatives in benchmarking, all instructions need to be forced into uncompressed form (i.e. boilerplated with .w suffix.)\ To not bother the "naive" writers it needs to be handled by the slothy via config.

cortex-m7: For maximum ipc, all instructions need to be uncompressed and one needs forget about load/store double/multiple. (no further penalties after "normal" stalls) Of course it is possible to compress (and use CISCy instructions) without penalties but I couldn't figure out the exact pattern and trial&error probing on HW is too much for superoptimizer.

cortex-m3/4: .w loads needs to be aligned (instruction bits) at word boundaries or will fail to pipeline. (Shwabe&Stoffelen aes work, went for "all uncompressed" way)

dop-amin commented 3 months ago

Hey @jnk0le, thanks for bringing this to our attention! In the development and tuning of the model, I also figured that we -- probably -- want .w everywhere. I think it should be available as an option to enable this.

However, this would require more changes than just in the printing as the scheduling properties of the expanded instruction may be different, as well as more possibilties for the register renaming should be taken into account, i.e., what's currently modeled as, e.g., eor_short in the architectural model should get transformed into 3-operand eor.

jnk0le commented 3 months ago

as the scheduling properties of the expanded instruction may be different,

Didn't spot such behaviour on M4/M7. Only the things like compiler preferring "shifted constant" over encoding T4 (better issuing) or having to chose between uxtb.n r0, r1 and and.w r0, r1, #0xff (better issuing)

That should be a thing on CM33 or CM55 though. (M85 can tripple issue nops and branches but that's independent of offending instruction size)

dop-amin commented 3 months ago

Thanks for your input on that matter!

Didn't spot such behaviour on M4/M7

I agree, just did not want to exclude that this case could come up. However, just adding the .w without switching to the 3-operand form of the instruction in slothy is still "a waste" as it limits the register renaming.

On, e.g., M85 this could matter though. From the Software Optimization Guide: "The latency from the shifter source operand is 2, regardless of whether the shift immediate value is non-zero or not." This means, using .w on an instruction where the shift is 0 and could be encoded in 16 bits will be promoted to the 32-bit form where the shift immediate is set to 0, incurring a latency penalty on the shifted argument. I see you have been running experiments on M85, too; have you been able to observe this?

jnk0le commented 3 months ago

"The latency from the shifter source operand is 2, regardless of whether the shift immediate value is non-zero or not." This means, using .w on an instruction where the shift is 0 and could be encoded in 16 bits will be promoted to the 32-bit form where the shift immediate is set to 0, incurring a latency penalty on the shifted argument.

seems to be the case on chained (3-4+) dependency, otherwise stall is somehow folded by early/late ALU.