slothy-optimizer / slothy

Assembly super-optimization via constraint solving
https://slothy-optimizer.github.io/slothy/
Other
167 stars 10 forks source link

Add Armv7M support for Cortex-M4 and Cortex-M7 #61

Open mkannwischer opened 5 months ago

mkannwischer commented 5 months ago

Continuation of #55

jnk0le commented 4 months ago

regarding the pipeline models, I did some analysis on those already. M7 (it was even cited by paper linked by one keccak file): https://github.com/jnk0le/random/tree/master/pipeline%20cycle%20test#cortex-m7 and M4: https://github.com/jnk0le/random/tree/master/pipeline%20cycle%20test#cortex-m3-and-m4

hanno-becker commented 1 month ago

@jnk0le Your document is very interesting and useful, thanks a lot. I've got a question. You write:

e.g. following snippet doesn't stall:
    add.w r0, r2
    eor.w r6, r0, r6, ror #22

I get that the add runs on the "early ALU" to be able to 0-latency fwd to eor. But doesn't the inline shift for the eor also have to run on the early ALU, in the same cycle?

jnk0le commented 1 month ago

r0 is forwarded as non shifted operand (ie. there is no false dependency by skewed operand), so it can forward to late ALU (inline shifted operand2 work similarly to CM85 except that non shifting ones are special cased to not use shifter)

inline shift of second op clobbers the shifter in "early" stage (aka EX1 in ARM nomenclature) so first instruction can't be a shifting one.

hanno-becker commented 1 month ago

@jnk0le Thanks -- but doesn't this mean that in EX1 we use shifter (for EOR) and adder (for add) at the same time?

jnk0le commented 1 month ago

Yes, there is one shifter and 2 AGUs, (presumably) one of which can execute add/sub/mov form older slot

(ie. there is no false dependency by skewed operand)

ldr result is however subject to this kind of false dependency

hanno-becker commented 1 month ago

@jnk0le agree, it's probably the AGU here. It doesn't work with EOR instead of ADD anymore, confirming that.