llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
26.85k stars 11.02k forks source link

[RISCV] Sources with different EEW must use different registers #80099

Open topperc opened 5 months ago

topperc commented 5 months ago

At some point this text was added to the vector specification

A vector register cannot be used to provide source operands with more than one EEW for a single instruction. A mask register source is considered to have EEW=1 for this constraint. An encoding that would result in the same vector register being read with two or more different EEWs, including when the vector register appears at different positions within two or more vector register groups, is reserved.

The RISC-V backend does not emit any instructions for bitcasts that change element size. This can cause us to use the same register with different EEWs. See an example here https://godbolt.org/z/WTGKfxo3f

Filing this based on a message on the mailing list here https://lists.riscv.org/g/tech-vector-ext/message/919

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Craig Topper (topperc)

At some point this text was added to the vector specification > A vector register cannot be used to provide source operands with more than one EEW for a single instruction. A mask register source is considered to have EEW=1 for this constraint. An encoding that would result in the same vector register being read with two or more different EEWs, including when the vector register appears at different positions within two or more vector register groups, is reserved. The RISC-V backend does not emit any instructions for bitcasts that change element size. This can cause us to use the same register with different EEWs. See an example here https://godbolt.org/z/WTGKfxo3f Filing this based on a message on the mailing list here https://lists.riscv.org/g/tech-vector-ext/message/919
rofirrim commented 5 months ago

I've been thinking about this and if the issue is only with more than one source using a different EEW (as opposed between a source and a destination) I think the only impacted instructions are indexed stores (when e<eew> is different to <sew>), vrgather.ei16 when sew != 16 and all the .wv instructions (widenings and narrowings where we could force the compiler with enough reinterpret and/or lmul-trunc/ext to use the same register in the w and v source operands). I'm not considering the usefulness hor how realistic of these use cases are, so we may not have to fix all of them.

I see two different approaches to attack this: before RA and after RA. In both approaches the goal is to use as a source another register containing the relevant data that will be used by the instruction (i.e. a copy) if we determine there is a conflict.

Before RA (and before insert-vsetvli), we can analyse the problematic MIs and check their vtype immediate (which tells us about the sew) and the opcode (which tells us about the eew/emul of the operands) and then check the vregs of the operands whether they are the same. I've been doing some experiments and in MachineIR sometimes we have copies between vregs (for convenience I guess), which means that equivalence is not trivial. A pro of this approach is that when we detect a conflict we can use vmv.v.v with eew/emul of the operand we replace and use the vl of the user instruction. One concern with this approach is that we're kind of guessing what RA will end doing with the vreg uses of this instruction (i.e. what if we had not done anything and the conflict didn't materialise this would be suboptimal).

Doing it after RA seems more precise, check the problematic MIs (I did a quick check and IIUC RISCVInsertVSETVLI pass is preserving the vtype immediate of the pseudos so this is still an option) and the opcode. If a conflict happens it is going to be obvious as now we're in physreg land. Now the issue is how to do the copy: I assume we can use RegScavenger to come up with a free vector register and then do, in the worse of the cases a vmv<n>r.v instruction. Doing a vmv.v.v might be harder because changing vtype to the required eew/emul of the operand we're going to copy might be more complicated. I think that split-RA makes easier to use vmv.v.v here because we have not done insert-vsetvli yet and so the proper vtype is still part of the instruction immediate. If RegScavenger fails to find a free register I understand we will have to allocate a spill slot, pick a suitable register and spill it, copy the value of the operand, update the MI and reload the register we had to spill (if the register pressure situation is this dire it probably doesn't matter to do pre-RA or post-RA except that pre-RA means all the spill stuff is already handled by RA).

Perhaps there is a simpler way to address this. Thoughts?

wangpc-pp commented 5 months ago

Can this issue be fixed by things like @earlyclobber or $rd = $dest constraints simply? We may be able to extend it to constrain two source registers, like $rs1 != $rs2, $rd != $rs1, etc.

topperc commented 5 months ago

Can this issue be fixed by things like @earlyclobber or $rd = $dest constraints simply? We may be able to extend it to constrain two source registers, like $rs1 != $rs2, $rd != $rs1, etc.

I don't think we can simply extend earlyclobber. earlyclobber doesn't really check the registers. earlyclobber makes the live interval of the destination start before the live intervals of the inputs ends. This forces the register allocator to pick different register since the intervals interfere.

For the issue here, we're only talking about sources so I don't think there's any manipulation of the live intervals we can do.

lukel97 commented 5 months ago

Doing it after RA seems more precise, check the problematic MIs (I did a quick check and IIUC RISCVInsertVSETVLI pass is preserving the vtype immediate of the pseudos so this is still an option) and the opcode.

The original vtype immediate is kept in the pseudos, but it may be out-of-date and not the same as what the pseudo actually executes with. There are some optimisations where we tweak the vtype to remove a vsetvli provided the instruction doesn't demand it.

e.g. loads and stores don't use the SEW, only the SEW/LMUL ratio from vtype. So if the previous vsetvli had a different SEW but the ratio is the same, we will just use that instead. So SEW in the vtype immediate will be mismatched.

With that said though, I don't think this applies to any of the instructions we're concerned about here. An indexed load/store should demand and use the exact SEW, so its vtype immediate should be accurate. And we don't do any vtype tricks with .wv arithmetic pseudos.