Open jalobaba opened 2 years ago
Hi, thanks @jalobaba, I appreciate your input. The first phase of Zc is adding 16-bit encodings (excluding those relating to EABI which isn't ratified). The second phase is likely to be adding 32-bit encodings, but TBD.
To encode 16-bit shift by register value, I think it would need to go into the following reserved spaces:
101 010 xxxx 10 101 10x xxxx 10
both of which overlap with C.FSDSP
The simple approach is to have encodings for CM.SLL, CM.SRL, C.SRA, which would take up both of those spaces, and each require two register operands. For example:
CM.SLL rsd', rsd', rs2'
rsd' = rsd' << rs2
out of range shifts return zero.
is this what you are thinking?
@abukharmeh how hard is this to add into the analysis script? If it's easy then I'm wondering if we should try and get them in now - given that it's much more efficient to add all the (non-EABI) 16-bit encodings at once. I think it's worth assessing them anyway to see if they help the benchmark suite. It would be more efficient if you ran it, but it would certainly help if we could get more people involved to assess new instructions.
Tariq
It should be fairly easy if we know what patterns we are trying to optimise, is this trying to achieve the same thing as #141 ?
It's different, that one has an immediate shift distance and I think different source and destination registers. It's worth nothing that these wouldn't expand into an existing 32 bit encoding which might be a problem, bits it's certainly worth knowing how valuable they are
It's worth nothing that these wouldn't expand into an existing 32 bit encoding
Why not SRL AND SLL ?
This does not look promising unless I am missing something, I will let you know exact % savings later in the week, or perhaps Saturday.
The proposed 16-bit shift instructions will expand into existing 32-bit instructions.
Tariq, you asked if these were the instructions I was proposing: CM.SLL rsd', rsd', rs2'
Yes, exactly this with 2 registers and destructive shift.
As for out of range shifts (i.e. rs2' > 31), I'd match the behavior of the 32-bit RISC-V shift instructions. I just checked the spec and the 32-bit shift instructions just use the bottom 5-bit of rs2 so I would match this for the 16-bit versions and then there is no special behavior for out of range.
yes - you're right - CM.SLL would be the 16-bit encoding of SLL, and similarly for SRL/CM.SRL, SRA/CM.SRA
give that there doesn't seem to be much value in these in our existing benchmark suit @jalobaba do you have an open source application which sees a benefit from them?
No, I don't have an open-source application that benefits from 16-bit shift by register value.
We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register?
What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach.
The script doesn't support that currently, but simplistically you can grep for the 32-bit destructive forms and compare the saving from shrinking those by 2 bytes with the size of the entire text section. That should be simple enough.
On Wed, 15 Jun 2022, 17:31 jalobaba, @.***> wrote:
No, I don't have an open-source application that benefits from 16-bit shift by register value.
We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register?
What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach.
— Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-code-size-reduction/issues/165#issuecomment-1156687322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOCTJAEYZD5I235CEHCPBI3VPIAPBANCNFSM5YLXEXPQ . You are receiving this because you commented.Message ID: @.***>
I think it would be easier to add that to the script, you just need to add two lines similar to the following snippet, and then you can get the exact savings https://github.com/riscv/riscv-code-size-reduction/blob/44ac85786eed91ff3768de5cdaca27113f0209dc/benchmarks/HCA/hca#L121-L124
Thanks Ibrahim. I'll be working on this next week.
I would be happy if you submit a PR for it :)
Even if it didn't up in the spec, it would help people try it on their applications etc.
So what you need to do is
elif action == 'cm_srl'
Then just use the script
hca -a cm_srl -f name_of_your_elf_fille
Hi, as promised earlier, here are the complete results from our benchmark suite
I doubt we can justify the encoding space for these instructions with these results !
Are we happy to close this now @jalobaba @tariqkurd-repo
0.09% and 0.05% for rsd
+rs2
instruction. Bang for the code point ratio seems to be in line with other instructions (c.mul
is also 0.09%)
Yes, you are right, they would be 2^(3+3)=64 points instructions as C.MUL which has similar savings !
Hi @abukharmeh are we also considering CM.SRA? Are there any results for that one? C.MUL has better resutls than these two added together, and of course each of these has the same number of code-points as C.MUL so the overall benefit is less than half. Note that C.MUL saves >1% on several benchmarks, so the overall average isn't very useful in this case.
I'm interested to see CM.SRA if possible.
At the moment I feel that we need to put more thought into the very small amount of remaining encoding space - we can't allocate all if it to Zc, and so that we should proceed without these instructions. that doesn't mean that we can't add them in the next revision of Zc if they can be justified against other options at that point.
...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them.
what do people think?
what do people think?
I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension !
c.mul peaks in pure dsp code, many of instances being multiply accumulate that can be handled by P ext as a single 32bit instruction. Otherwise it's average 0.2x% or lower bottomed down by debian packages.
In terms of code point efficiency c.lhu
,c.lh
,c.sh
are even worser
...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them
think its too late to add anything if we want ratification this year,!
I see one significant issue with "adding zce insrcutions later"; fragmentation. And those are likely to end up not used by A profile packages at all
eg. Zcev1 goes into profile with mandatory vector, such profile is very likely to become defacto lowest common denominator for targeting software (and optionally RV64GC as "legacy" one). If Zcev2 brings extra 0.5% code size reduction - nobody would like to break with LCD as well as provide yet another binary for just that 0.5%.
zcm* situation is a bit better, but not entirely.
what do people think? I agree, I think its too late to add anything if we want ratification this year, and I dont see the savings from these instructions big enough to justify any addition to the current revision of extension
In our opinion it is definitely not worth putting ratification this year at risk by adding new instructions. Given that Zc* already consists of 7 groups and that likely/hopefully other ideas we will pop up, these instructions (and others) can be added later (if needed in a new sub-extension).
yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.
agreed - so no change now - and we can consider these in the future.
consists of 7 groups
In current zce spec, i see only 4 (+1 for RVE eabi) "fragmentation" groups, 3 (+1) of them being embedded only.
yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse.
~given that the current compiler results match exactly the script ones.~ (they don't, table added on the right was misleading) Maybe the compiler proof step could be skipped for obviously simple things like sll/srl/sra in v0.80
there will only be a v0.80 if the architecture review committee request it
@tariqkurd-repo Can we keep the issue open as its future idea ?
I recently was reviewing the Zc spec and noticed it was missing 16-bit shift instructions by a register value. Zc only currently has 16-bit shift instructions that shift by an immediate value.
@tariqkurd told me these instructions haven't been thought of yet. Can we explore the code size benefit they provide and consider them for Zc?
Note that Arm Thumb-2 includes 16-bit shift by register instructions as do some other ISAs.
BTW, I'm new to this group. I've been watching for a while but plan to get more involved. I work for Qualcomm in San Diego and am a processor architect & designer (mostly embedded processors).