linuxppc / issues

Issues repository for linuxppc
5 stars 0 forks source link

CC_OPTIMIZE_FOR_SIZE leads to awful duplication of some 'static inline' functions #352

Closed chleroy closed 3 years ago

chleroy commented 3 years ago

The config provided by the kernel robot in https://lore.kernel.org/lkml/202102271820.WlZCxtzY-lkp@intel.com/T/#u leads to awful duplication of several 'static inline' functions.

arch_local_irq_save is there 44 times

c2095444 <arch_local_irq_save>:
c2095444:   94 21 ff f0     stwu    r1,-16(r1)
c2095448:   7c 60 00 a6     mfmsr   r3
c209544c:   54 69 04 5e     rlwinm  r9,r3,0,17,15
c2095450:   7d 20 01 24     mtmsr   r9
c2095454:   38 21 00 10     addi    r1,r1,16
c2095458:   4e 80 00 20     blr

fls is there 61 times

c2033ee8 <fls>:
c2033ee8:   94 21 ff f0     stwu    r1,-16(r1)
c2033eec:   7c 63 00 34     cntlzw  r3,r3
c2033ef0:   38 21 00 10     addi    r1,r1,16
c2033ef4:   20 63 00 20     subfic  r3,r3,32
c2033ef8:   4e 80 00 20     blr

__ilog2_u32 is there 12 times in one version and 10 times in a second version which calls fls

c20326bc <__ilog2_u32>:
c20326bc:   94 21 ff f0     stwu    r1,-16(r1)
c20326c0:   7c 63 00 34     cntlzw  r3,r3
c20326c4:   38 21 00 10     addi    r1,r1,16
c20326c8:   20 63 00 1f     subfic  r3,r3,31
c20326cc:   4e 80 00 20     blr
c20d42d4 <__ilog2_u32>:
c20d42d4:   94 21 ff f0     stwu    r1,-16(r1)
c20d42d8:   7c 08 02 a6     mflr    r0
c20d42dc:   90 01 00 14     stw     r0,20(r1)
c20d42e0:   4b ff ff e1     bl      c20d42c0 <fls>
c20d42e4:   80 01 00 14     lwz     r0,20(r1)
c20d42e8:   38 63 ff ff     addi    r3,r3,-1
c20d42ec:   38 21 00 10     addi    r1,r1,16
c20d42f0:   7c 08 03 a6     mtlr    r0
c20d42f4:   4e 80 00 20     blr

others like arch_set_bit, arch_clear_bit found many times as well

c256d71c <set_bits>:
c256d71c:   94 21 ff f0     stwu    r1,-16(r1)
c256d720:   7d 20 20 28     lwarx   r9,0,r4
c256d724:   7d 29 1b 78     or      r9,r9,r3
c256d728:   7d 20 21 2d     stwcx.  r9,0,r4
c256d72c:   40 a2 ff f4     bne     c256d720 <set_bits+0x4>
c256d730:   38 21 00 10     addi    r1,r1,16
c256d734:   4e 80 00 20     blr
c256d738 <clear_bits>:
c256d738:   94 21 ff f0     stwu    r1,-16(r1)
c256d73c:   7d 20 20 28     lwarx   r9,0,r4
c256d740:   7d 29 18 78     andc    r9,r9,r3
c256d744:   7d 20 21 2d     stwcx.  r9,0,r4
c256d748:   40 a2 ff f4     bne     c256d73c <clear_bits+0x4>
c256d74c:   38 21 00 10     addi    r1,r1,16
c256d750:   4e 80 00 20     blr
c256d754 <arch_set_bit>:
c256d754:   94 21 ff f0     stwu    r1,-16(r1)
c256d758:   39 20 00 01     li      r9,1
c256d75c:   7d 23 18 30     slw     r3,r9,r3
c256d760:   38 21 00 10     addi    r1,r1,16
c256d764:   4b ff ff b8     b       c256d71c <set_bits>
c256d768 <arch_clear_bit>:
c256d768:   94 21 ff f0     stwu    r1,-16(r1)
c256d76c:   39 20 00 01     li      r9,1
c256d770:   7d 23 18 30     slw     r3,r9,r3
c256d774:   38 21 00 10     addi    r1,r1,16
c256d778:   4b ff ff c0     b       c256d738 <clear_bits>
c20ffbc0 <arch_set_bit>:
c20ffbc0:   39 20 00 01     li      r9,1
c20ffbc4:   94 21 ff f0     stwu    r1,-16(r1)
c20ffbc8:   7d 23 18 30     slw     r3,r9,r3
c20ffbcc:   7d 20 20 28     lwarx   r9,0,r4
c20ffbd0:   7d 29 1b 78     or      r9,r9,r3
c20ffbd4:   7d 20 21 2d     stwcx.  r9,0,r4
c20ffbd8:   40 a2 ff f4     bne     c20ffbcc <arch_set_bit+0xc>
c20ffbdc:   38 21 00 10     addi    r1,r1,16
c20ffbe0:   4e 80 00 20     blr
c20ffbe4 <arch_clear_bit>:
c20ffbe4:   39 20 00 01     li      r9,1
c20ffbe8:   94 21 ff f0     stwu    r1,-16(r1)
c20ffbec:   7d 23 18 30     slw     r3,r9,r3
c20ffbf0:   7d 20 20 28     lwarx   r9,0,r4
c20ffbf4:   7d 29 18 78     andc    r9,r9,r3
c20ffbf8:   7d 20 21 2d     stwcx.  r9,0,r4
c20ffbfc:   40 a2 ff f4     bne     c20ffbf0 <arch_clear_bit+0xc>
c20ffc00:   38 21 00 10     addi    r1,r1,16
c20ffc04:   4e 80 00 20     blr

And a lot more, like for instance the I/O accessors in_be32/in_le32, ...

mpe commented 3 years ago

I can reproduce that just by turning on CONFIG_CC_OPTIMIZE_FOR_SIZE.

mpe commented 3 years ago
$ make -s ppc64le_defconfig
$ grep -e LD_DEAD -e OPTIMIZE_FOR .config
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
$ make -s -j (nproc)
$ objdump -d vmlinux | grep -c "<arch_set_bit>:"
0

$ ./scripts/config -d CC_OPTIMIZE_FOR_PERFORMANCE -e CC_OPTIMIZE_FOR_SIZE
$ make -s olddefconfig
$ grep -e LD_DEAD -e OPTIMIZE_FOR .config
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
$ make -s -j (nproc)
$ objdump -d vmlinux | grep -c "<arch_set_bit>:"
82

$ ./scripts/config -e EXPERT -e LD_DEAD_CODE_DATA_ELIMINATION                                                                                                                                                   
$ make olddefconfig                                                                                                                                                                                             
$ grep -e LD_DEAD -e OPTIMIZE_FOR .config                                                                                                                                                                       
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION=y
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
$ make -s -j (nproc)                                                                                                                                                                                            
$ objdump -d vmlinux | grep -c "<arch_set_bit>:"                                                                                                                                                                
82
mpe commented 3 years ago

So CC_OPTIMIZE_FOR_SIZE causes it AFAICS, and LD_DEAD_CODE_DATA_ELIMINATION has no effect.

npiggin commented 3 years ago

Yeah they're not dead, they just get out-of-lined into each file that calls them. The linker can't really fix this (may not have the right relocations or branch information). It would need some kind of link time optimisation or some new annotation like inline_or_library and then you give it a library copy if it decides not to inline.

mpe commented 3 years ago

Yeah I was originally thinking that something to do with the dead code elimination might be making it worse, but seems unrelated.

mpe commented 3 years ago

It seems this is working as designed, even if the result is a bit surprising.

And AFAICS CC_OPTIMIZE_FOR_SIZE still "works", it shrinks a ppc64le_defconfig from ~32MB to ~27MB.

mpe commented 3 years ago

Can we close this?

chleroy commented 3 years ago

Ok, let's close it as we have identified it is related to CC_OPTIMISE_FOR_SIZE.

We will likely handle the most evident ones one-by-one by flagging them "always_inline" when relevant.