16-bit instruction to shift by a register value

jalobaba commented 2 years ago

I recently was reviewing the Zc spec and noticed it was missing 16-bit shift instructions by a register value. Zc only currently has 16-bit shift instructions that shift by an immediate value.

@tariqkurd told me these instructions haven't been thought of yet. Can we explore the code size benefit they provide and consider them for Zc?

Note that Arm Thumb-2 includes 16-bit shift by register instructions as do some other ISAs.

BTW, I'm new to this group. I've been watching for a while but plan to get more involved. I work for Qualcomm in San Diego and am a processor architect & designer (mostly embedded processors).

tariqkurd-repo commented 2 years ago

Hi, thanks @jalobaba, I appreciate your input. The first phase of Zc is adding 16-bit encodings (excluding those relating to EABI which isn't ratified). The second phase is likely to be adding 32-bit encodings, but TBD.

To encode 16-bit shift by register value, I think it would need to go into the following reserved spaces:

101 010 xxxx 10 101 10x xxxx 10

both of which overlap with C.FSDSP

The simple approach is to have encodings for CM.SLL, CM.SRL, C.SRA, which would take up both of those spaces, and each require two register operands. For example:

CM.SLL rsd', rsd', rs2'

rsd' = rsd' << rs2

out of range shifts return zero.

is this what you are thinking?

@abukharmeh how hard is this to add into the analysis script? If it's easy then I'm wondering if we should try and get them in now - given that it's much more efficient to add all the (non-EABI) 16-bit encodings at once. I think it's worth assessing them anyway to see if they help the benchmark suite. It would be more efficient if you ran it, but it would certainly help if we could get more people involved to assess new instructions.

Tariq

abukharmeh commented 2 years ago

It should be fairly easy if we know what patterns we are trying to optimise, is this trying to achieve the same thing as #141 ?

tariqkurd-repo commented 2 years ago

It's different, that one has an immediate shift distance and I think different source and destination registers. It's worth nothing that these wouldn't expand into an existing 32 bit encoding which might be a problem, bits it's certainly worth knowing how valuable they are

abukharmeh commented 2 years ago

It's worth nothing that these wouldn't expand into an existing 32 bit encoding Why not SRL AND SLL ?

This does not look promising unless I am missing something, I will let you know exact % savings later in the week, or perhaps Saturday.

jalobaba commented 2 years ago

The proposed 16-bit shift instructions will expand into existing 32-bit instructions.

jalobaba commented 2 years ago

Tariq, you asked if these were the instructions I was proposing: CM.SLL rsd', rsd', rs2'

Yes, exactly this with 2 registers and destructive shift.

As for out of range shifts (i.e. rs2' > 31), I'd match the behavior of the 32-bit RISC-V shift instructions. I just checked the spec and the 32-bit shift instructions just use the bottom 5-bit of rs2 so I would match this for the 16-bit versions and then there is no special behavior for out of range.

tariqkurd-repo commented 2 years ago

yes - you're right - CM.SLL would be the 16-bit encoding of SLL, and similarly for SRL/CM.SRL, SRA/CM.SRA

give that there doesn't seem to be much value in these in our existing benchmark suit @jalobaba do you have an open source application which sees a benefit from them?

jalobaba commented 2 years ago

No, I don't have an open-source application that benefits from 16-bit shift by register value.

We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register?

What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach.

tariqkurd-repo commented 2 years ago

The script doesn't support that currently, but simplistically you can grep for the 32-bit destructive forms and compare the saving from shrinking those by 2 bytes with the size of the entire text section. That should be simple enough.

On Wed, 15 Jun 2022, 17:31 jalobaba, @.***> wrote:

No, I don't have an open-source application that benefits from 16-bit shift by register value.

We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register?

What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach.

— Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-code-size-reduction/issues/165#issuecomment-1156687322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOCTJAEYZD5I235CEHCPBI3VPIAPBANCNFSM5YLXEXPQ . You are receiving this because you commented.Message ID: @.***>

abukharmeh commented 2 years ago

I think it would be easier to add that to the script, you just need to add two lines similar to the following snippet, and then you can get the exact savings https://github.com/riscv/riscv-code-size-reduction/blob/44ac85786eed91ff3768de5cdaca27113f0209dc/benchmarks/HCA/hca#L121-L124

jalobaba commented 2 years ago

Thanks Ibrahim. I'll be working on this next week.

abukharmeh commented 2 years ago

I would be happy if you submit a PR for it :)

Even if it didn't up in the spec, it would help people try it on their applications etc.

So what you need to do is

Add a new action name, e.g elif action == 'cm_srl'
Use retrieve field function with required instructions list set as [srl], and additional condition as new lambda function that checks that the one of the source register and the destination register are equal, and all registers are within the compressed format (Can use utils.regwithin as that is just a function that checks if a register with an ABI name is from ones that fit the compressed encoding)
Use apply_optimizations to patch the instructions, the most important factor is to change WoE (Width of encoding to 16)

Then just use the script

hca -a cm_srl -f name_of_your_elf_fille

abukharmeh commented 2 years ago

Hi, as promised earlier, here are the complete results from our benchmark suite

I doubt we can justify the encoding space for these instructions with these results !

Are we happy to close this now @jalobaba @tariqkurd-repo

jnk0le commented 2 years ago

0.09% and 0.05% for rsd+rs2 instruction. Bang for the code point ratio seems to be in line with other instructions (c.mul is also 0.09%)

abukharmeh commented 2 years ago

Yes, you are right, they would be 2^(3+3)=64 points instructions as C.MUL which has similar savings !

tariqkurd-repo commented 2 years ago

Hi @abukharmeh are we also considering CM.SRA? Are there any results for that one? C.MUL has better resutls than these two added together, and of course each of these has the same number of code-points as C.MUL so the overall benefit is less than half. Note that C.MUL saves >1% on several benchmarks, so the overall average isn't very useful in this case.

I'm interested to see CM.SRA if possible.

At the moment I feel that we need to put more thought into the very small amount of remaining encoding space - we can't allocate all if it to Zc, and so that we should proceed without these instructions. that doesn't mean that we can't add them in the next revision of Zc if they can be justified against other options at that point.

...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them.

what do people think?

abukharmeh commented 2 years ago

what do people think? I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension !

jnk0le commented 2 years ago

c.mul peaks in pure dsp code, many of instances being multiply accumulate that can be handled by P ext as a single 32bit instruction. Otherwise it's average 0.2x% or lower bottomed down by debian packages.

In terms of code point efficiency c.lhu,c.lh,c.sh are even worser

jnk0le commented 2 years ago

...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them

think its too late to add anything if we want ratification this year,!

I see one significant issue with "adding zce insrcutions later"; fragmentation. And those are likely to end up not used by A profile packages at all

eg. Zcev1 goes into profile with mandatory vector, such profile is very likely to become defacto lowest common denominator for targeting software (and optionally RV64GC as "legacy" one). If Zcev2 brings extra 0.5% code size reduction - nobody would like to break with LCD as well as provide yet another binary for just that 0.5%.

zcm* situation is a bit better, but not entirely.

Silabs-ArjanB commented 2 years ago

what do people think? I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension

In our opinion it is definitely not worth putting ratification this year at risk by adding new instructions. Given that Zc* already consists of 7 groups and that likely/hopefully other ideas we will pop up, these instructions (and others) can be added later (if needed in a new sub-extension).

tariqkurd-repo commented 2 years ago

yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.

tariqkurd-repo commented 2 years ago

agreed - so no change now - and we can consider these in the future.

jnk0le commented 2 years ago

consists of 7 groups

In current zce spec, i see only 4 (+1 for RVE eabi) "fragmentation" groups, 3 (+1) of them being embedded only.

yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.

~given that the current compiler results match exactly the script ones.~ (they don't, table added on the right was misleading) Maybe the compiler proof step could be skipped for obviously simple things like sll/srl/sra in v0.80

tariqkurd-repo commented 2 years ago

there will only be a v0.80 if the architecture review committee request it

abukharmeh commented 2 years ago

@tariqkurd-repo Can we keep the issue open as its future idea ?

riscvarchive / riscv-code-size-reduction

16-bit instruction to shift by a register value #165