openhwgroup / corev-binutils-gdb

GNU General Public License v2.0
9 stars 26 forks source link

Force jump table inclusion #44

Closed silabs-kjetil closed 1 year ago

silabs-kjetil commented 1 year ago

Hi,

Is there a way to force the linker to include the Zcmt jump table? I am trying to build some projects using the Zcmt extension but the linker does not seem to be adding the jump table in my specific case. To test this issue I created some other small projects and it seems like the jump table is added in some cases but not in all cases.

For instance if I create a small hello-world project using printf() and linking with newlib libc.a then the jump table is added and used, however if I link the same application with newlib-nano libc_nano.a then the jump table is not included. What is triggering the inclusion of the jump table in some cases and not in others?

simonpcook commented 1 year ago

My understanding from testing is the jump table is only added when it is profitable to do so (i.e. the size saving of having the entire table outweighs the size of the table itself). Since there are 64 entries (4 bytes each, therefore a 256 byte table), the saving of using these instructions would need to more than 256 bytes. In the case exactly 256 bytes (breakeven) would be saved the table is still not generated, there must be a saving).

The reason you are seeing this with newlib's printf is that with all the other functions printf includes, presumably this threshold was met, but the smaller nano-newlib presumably doesn't.

In response to the original question, no I don't think there is a way to force the table to be generated if the linker determines it would not decrease size.

silabs-kjetil commented 1 year ago

I see, so there is a calculation somewhere in the linker that figures out the size saving and decides to use it or not. That makes sense, now I can see if I find out where in the code this decision is made and override it for some experimental testing.

Looking at the specification on https://github.com/riscv/riscv-code-size-reduction/releases it looks like the table could be up to 256 elements, so a full table would require 1024 bytes on rv32. However I would think that it's legal to have a truncated table for small applications and still get the size benefits.

I made a build of the linker where I modified the ZCMT_PRINT_TABLE_JUMP_ENTRIES variable to get some debug information. This prints out a table of symbols and code size savings which is useful when debugging the behavior. Looking at this table it looks like 64 entries are assigned to cm.jt while the specification only allocates 16 entries to the cm.jt entries as far as I can see (index < 16 = cm.jt).

linsinan1995 commented 1 year ago

@silabs-kjetil

I see, so there is a calculation somewhere in the linker that figures out the size saving and decides to use it or not. That makes sense, now I can see if I find out where in the code this decision is made and override it for some experimental testing.

that is a good idea.

Looking at the specification on https://github.com/riscv/riscv-code-size-reduction/releases it looks like the table could be up to 256 elements, so a full table would require 1024 bytes on rv32. However I would think that it's legal to have a truncated table for small applications and still get the size benefits.

The truncation for the entry table is supported in gnu toolchain, but cm.jalt is always started from index 63, and it might be the reason why no table is generated.

I made a build of the linker where I modified the ZCMT_PRINT_TABLE_JUMP_ENTRIES variable to get some debug information. This prints out a table of symbols and code size savings which is useful when debugging the behavior. Looking at this table it looks like 64 entries are assigned to cm.jt while the specification only allocates 16 entries to the cm.jt entries as far as I can see (index < 16 = cm.jt).

it sounds like a bug. Can you provide a test case for it?

silabs-kjetil commented 1 year ago

The change in the number of entries allocated for cm.jt seems to be a quite recent addition to the Zc specification.

abukharmeh commented 1 year ago

@silabs-kjetil Out of curiosity, what are these 64 entries and how many time each is called ? We are not completely sure that this is the split that would be frozen but all our benchmarks benefit very little beyond the first 16 entries of c.jt PS: I know that the implementation has an issue where it allocate entries that would make the code size worse, but I haven't got around to generating a minimal testcase for that.

abukharmeh commented 1 year ago

Also there was a discussion somewhere that the table should grow outward from the split point, not sure that's what is done here or if that is feasible ?

simonpcook commented 1 year ago

Reading the comments here, I realise I was misinterpreting what I see out of objdump when describing the table, and why I was misinterpreting the size.

If I look at the linker map file for an example I have (calling sqrt a large number of times), I see that it is 0x13c in size, which would suggest it's being dynamically sized. Indeed that is what I'm seeing in objdump:

000100e0 <__jvt_base$>:
   100e0:       00025a4c                index 0 # 25a4c <atexit>
   100e4:       0002589c                index 1 # 2589c <exit>
        ...
   101e0:       000257a0                index 64 # 257a0 <__floatsidf>
   101e4:       00023d80                index 65 # 23d80 <__adddf3>
   101e8:       00023ba4                index 66 # 23ba4 <sqrt>
   101ec:       0002581c                index 67 # 2581c <__hidden___udivsi3>
   101f0:       00025848                index 68 # 25848 <__umodsi3>
   101f4:       00024488                index 69 # 24488 <__divdf3>
   101f8:       0002590e                index 70 # 2590e <memset>
   101fc:       000258b4                index 71 # 258b4 <__libc_init_array>
   10200:       00025894                index 72 # 25894 <__errno>
   10204:       00024a08                index 73 # 24a08 <__ledf2>
   10208:       00025738                index 74 # 25738 <__fixdfsi>
   1020c:       00025a4c                index 75 # 25a4c <atexit>
   10210:       00024fd8                index 76 # 24fd8 <__subdf3>
   10214:       000256f8                index 77 # 256f8 <__unorddf2>
   10218:       00024ac8                index 78 # 24ac8 <__muldf3>

with the next function starts at 1021c.

I think the reason I thought the full table was 64 entries in size was that test file only called one function, so I was misreading index 64 (i.e. the first entry for a call) as the being the table is 64-entries in size. Apologies for the confusion there.

silabs-kjetil commented 1 year ago

@silabs-kjetil Out of curiosity, what are these 64 entries and how many time each is called ? We are not completely sure that this is the split that would be frozen but all our benchmarks benefit very little beyond the first 16 entries of c.jt PS: I know that the implementation has an issue where it allocate entries that would make the code size worse, but I haven't got around to generating a minimal testcase for that.

My guess is that the ideal split would be application dependent. Here is an example of the entries in a use case where the table is dropped due to the assumption that it would not provide space savings. It looks like I'm hitting the issue you are describing where entries are allocated that would make the code size worse.

cm.jt:
        index=0, sym name=.L1, address=0x000118a4, savings=14
        index=1, sym name=.L12, address=0x00011914, savings=12
        index=2, sym name=.L23, address=0x00011614, savings=10
        index=3, sym name=.L26, address=0x00011bd4, savings=10
        index=4, sym name=.L15, address=0x000115bc, savings=8
        index=5, sym name=_fwalk_sglue, address=0x00010a88, savings=8
        index=6, sym name=__sflush_r, address=0x00012110, savings=6
        index=7, sym name=.L19, address=0x0001221c, savings=4
        index=8, sym name=_fclose_r, address=0x00011fd8, savings=4
        index=9, sym name=.L56, address=0x00012244, savings=4
        index=10, sym name=.L21, address=0x000119c8, savings=4
        index=11, sym name=.L57, address=0x00011b34, savings=4
        index=12, sym name=.L62, address=0x00011e30, savings=4
        index=13, sym name=register_tm_clones, address=0x00010548, savings=4
        index=14, sym name=.L13, address=0x00011320, savings=4
        index=15, sym name=.L20, address=0x00010914, savings=4
        index=16, sym name=__malloc_unlock, address=0x00011fd4, savings=4
        index=17, sym name=.Lwordified, address=0x00010fe0, savings=2
        index=18, sym name=.L16, address=0x00011210, savings=2
        index=19, sym name=.L4, address=0x00011400, savings=2
        index=20, sym name=.L59, address=0x00011de4, savings=2
        index=21, sym name=.L6, address=0x0001260c, savings=2
        index=22, sym name=.L86, address=0x00011a1c, savings=2
        index=23, sym name=.L4, address=0x00012004, savings=2
        index=24, sym name=.L1, address=0x000125d4, savings=2
        index=25, sym name=.L4, address=0x00012240, savings=2
        index=26, sym name=.L104, address=0x00011eb0, savings=2
        index=27, sym name=.L7, address=0x000110c8, savings=2
        index=28, sym name=_close_r, address=0x00010cbc, savings=2
        index=29, sym name=.L47, address=0x00011a08, savings=2
        index=30, sym name=.L58, address=0x0001237c, savings=2
        index=31, sym name=.L5, address=0x00011294, savings=2
        index=32, sym name=.L8, address=0x00010bd8, savings=2
        index=33, sym name=.L90, address=0x0001187c, savings=2
        index=34, sym name=.L7, address=0x0001219c, savings=2
        index=35, sym name=.L1, address=0x00012698, savings=2
        index=36, sym name=.L43, address=0x00011d8c, savings=2
        index=37, sym name=.L31, address=0x0001167c, savings=2
        index=38, sym name=.L14, address=0x000109a8, savings=2
        index=39, sym name=.L45, address=0x00011788, savings=2
        index=40, sym name=.L1, address=0x00012764, savings=2
        index=41, sym name=.L1, address=0x00012650, savings=2
        index=42, sym name=.L2, address=0x00012528, savings=2
        index=43, sym name=.L19, address=0x00011578, savings=2
        index=44, sym name=.L12, address=0x00010988, savings=2
        index=45, sym name=atexit, address=0x0001136c, savings=2
        index=46, sym name=.L8, address=0x00012064, savings=2
        index=47, sym name=_write_r, address=0x00010ecc, savings=2
        index=48, sym name=.L64, address=0x00011d40, savings=2
        index=49, sym name=.L19, address=0x00011940, savings=2
        index=50, sym name=global_stdio_init.part.0, address=0x00010764, savings=2
        index=51, sym name=.L8, address=0x0001186c, savings=2
        index=52, sym name=.L53, address=0x00011d2c, savings=2
        index=53, sym name=__register_exitproc, address=0x00012518, savings=2
        index=54, sym name=.L28, address=0x00011640, savings=2
        index=55, sym name=exit, address=0x00010680, savings=2
        index=56, sym name=.L31, address=0x00011c00, savings=2
        index=57, sym name=.L4, address=0x00011854, savings=2
        index=58, sym name=.L10, address=0x00011304, savings=2
        index=59, sym name=.L3, address=0x00012630, savings=2
        index=60, sym name=.L13, address=0x00010908, savings=2
        index=61, sym name=.L51, address=0x00011ae0, savings=2
        index=62, sym name=.Laligned, address=0x00010fdc, savings=2
cm.jalt:
        index=64, sym name=__malloc_unlock, address=0x00011fd4, savings=18
        index=65, sym name=_free_r, address=0x000114dc, savings=18
        index=66, sym name=_sbrk_r, address=0x00012460, savings=12
        index=67, sym name=memset, address=0x00010fc8, savings=12
        index=68, sym name=__errno, address=0x00012790, savings=12
        index=69, sym name=__malloc_lock, address=0x00011fd0, savings=8
        index=70, sym name=__sinit, address=0x00010a1c, savings=6
        index=71, sym name=_lseek_r, address=0x00010dfc, savings=4
        index=72, sym name=_fclose_r, address=0x00011fd8, savings=4
        index=73, sym name=global_stdio_init.part.0, address=0x00010764, savings=2
        index=74, sym name=__sflush_r, address=0x00012110, savings=2
        index=75, sym name=_close, address=0x000125b8, savings=2
        index=76, sym name=main, address=0x00010658, savings=2
        index=77, sym name=_write, address=0x00012748, savings=2
        index=78, sym name=__call_exitprocs, address=0x00011248, savings=2
        index=79, sym name=_sbrk, address=0x000126c4, savings=2
        index=80, sym name=_malloc_r, address=0x000117f8, savings=2
        index=81, sym name=_read_r, address=0x00010e64, savings=2
        index=82, sym name=_lseek, address=0x00012634, savings=2
        index=83, sym name=__libc_init_array, address=0x00010f34, savings=2
        index=84, sym name=_malloc_trim_r, address=0x00011384, savings=2
        index=85, sym name=_exit, address=0x00012600, savings=2
        index=86, sym name=_read, address=0x0001267c, savings=2
        index=87, sym name=__sfp_lock_release, address=0x00010a4c, savings=2
        index=88, sym name=__sfp_lock_acquire, address=0x00010a48, savings=2
        index=89, sym name=atexit, address=0x0001136c, savings=2
        index=90, sym name=deregister_tm_clones, address=0x0001051c, savings=2
        index=91, sym name=memcpy, address=0x000110a4, savings=2
silabs-kjetil commented 1 year ago

It looks like it should be possible to fix this issue by removing all entries where savings<=4 from the table to get good results.

abukharmeh commented 1 year ago

@linsinan1995 It looks like the table for c.jt grows downward, I imagine it would be better to grow from the split point as that would ensure we reserve exactly what we use ?

How does the linker calculate the savings, is this performed after the normal jump instruction relaxation pass ? I meant does it take into account how jumps can be turned into either 16/32 instruction or a 64 bit sequence, or would that have terrible runtime complexity ?

The savings values printed in the debug log, do they represent bytes saved, do they take into account each entry size ?

linsinan1995 commented 1 year ago

@abukharmeh

It looks like the table for c.jt grows downward, I imagine it would be better to grow from the split point as that would ensure we reserve exactly what we use ?

you're right, it would help. If there is no any rule for entry table layout in spec, it would be a better choice.

How does the linker calculate the savings, is this performed after the normal jump instruction relaxation pass ? I meant does it take into account how jumps can be turned into either 16/32 instruction or a 64 bit sequence, or would that have terrible runtime complexity?

The calculation happens at the beginning of the relaxation, and different outputs of relaxation are considered. This two slides might give you more information, one is about the general idea and saving calculation and the second one contains some updates such as trimming:

https://docs.google.com/presentation/d/1ilfCIc9Bv4Ryl2Uyl3uwYtANlh9GSIroQC8Z-VIfUPk/edit?usp=sharing https://docs.google.com/presentation/d/1dHNIZAUT_U3R1Gygmg4BNhGZacJA6C3j5hCup5bA950/edit?usp=sharing

The savings values printed in the debug log, do they represent bytes saved, do they take into account each entry size ?

in the debug log, size savings indicate the extra bytes saved, compared to relaxation. And the entry size will be considered when calculating the overall benefits.

if (total_bytes_saved >= n_entries * entry_size)
  emit_table();

https://github.com/openhwgroup/corev-binutils-gdb/commit/1ac492b95bd468065e91e0df461c804fc72bf70f#diff-2ce943e176d34135968d6487a1b807347817fd0b06b387a2b55e113e3dc6b8faR5320-R5321

linsinan1995 commented 1 year ago

@silabs-kjetil

It looks like it should be possible to fix this issue by removing all entries where savings<=4 from the table to get good results.

The trimming for the entry table is very trivial currently and there is still room for improvement. e.g. the TODO here

One of the possible reasons for the worse size is the pass order of zcmt. IMO, the best phrase to generate table jump instructions is behind all jump relaxation happened or at the post-link optimization stage(BOLT for example), because the result calculated before relaxation might be inaccurate.

linsinan1995 commented 1 year ago

@abukharmeh

@linsinan1995 It looks like the table for c.jt grows downward, I imagine it would be better to grow from the split point as that would ensure we reserve exactly what we use.

There is one problem with this change. Even if c.jt grows from the split point, the unused slot at the beginning of the entry table still can not be removed, since c.jalt should strictly start from index 64.

This restriction can be relaxed if the remaining bits in JVT csr can be utilized to indicate the entry number of c.jt. Or JVT points to the split point instead of the start point of the entry table section.

jeremybennett commented 1 year ago

This is now resolved.