riscv-non-isa / rvv-intrinsic-doc

https://jira.riscv.org/browse/RVG-153
BSD 3-Clause "New" or "Revised" License
300 stars 89 forks source link

Passing vl as arguments of intrinsics #8

Closed Hsiangkai closed 4 years ago

Hsiangkai commented 4 years ago

We provide intrinsics with/without vl at the same time.

vop_vv_type(a, b) vop_vv_type_vl(a, b, gvl)

rdolbeau commented 4 years ago

I think this is related to my comment in #6 (https://github.com/sifive/rvv-intrinsic-doc/issues/6#issuecomment-613368611) on LMUL/SEW/VL being parameters.

The ability to (temporarily) change the VL is useful for some mixed-type usage (i.e. the same data are handled as different types of different element lengths). Whether it's legit in the specifications to 'play' with VL, and how much an implementation can restrict such use case is unclear to me.

nick-knight commented 4 years ago

To avoid muddying the waters, let's stick to the terminology of the V-extension spec: "AVL" (application vector length) is what you request, "VL" is what you're given. If you request AVL too large, then you'll get a strictly smaller VL (the typical case, in my experience).

So what we're talking about here is whether or not our intrinsics should have "AVL" as an input argument. Obviously the "SETVL" intrinsic must input AVL, so the rest of this comment concerns the non-SETVL intrinsics.

Personally, I have not found a use-case where I wanted/needed to be able to input AVL to an intrinsic. (@rdolbeau: I'd love to learn more about the application you have in mind.) In my opinion, passing in AVL adds unnecessary boilerplate and makes writing intrinsics code more error-prone, ultimately increasing the cognitive load on the programmer. I only plan to use the non-AVL-argument versions.

I would support removing the AVL-argument versions, but I insist that the non-AVL-argument versions remain.

rdolbeau commented 4 years ago

Many integer codes, usually crypto, starting with Chacha20 (example available here: http://www.dolbeau.name/dolbeau/crypto/, targeted at the EPI compiler).

jim-wilson commented 4 years ago

I personally don't care whether intrinsics have an AVL or VL argument or not. However, I think we should discourage people from trying to modify a VL that is obtained from a vsetvl intrinsic. I saw a mention of making them rvalues which looks like a good idea to me. Though I'm not sure if that can be easily implemented.

rdolbeau commented 4 years ago

The user doesn't need to modify VL, but it sure make his or her or their life a whole lot easier. In fact, the ability to change VL is what has drawn me to V in the first place :-)

Take a trivial example like DAXPY for illustration:

for (i = 0 ; i < N ; i++)
  Y[i] = A*X[i] + Y[i];

For most instruction set with a fixed width VW (assumed to be a power of 2 here), you end up needing to do this, an explicit tail loop:

for (i = 0 ; i < N & ~(VW-1) ; i+= VW) {loop L1}
  vector(Y[i..i+WV-1]) = vector(A) * vector(X[i..i+WV-1]) + Y[i..i+WV-1];
for (/* i */ ; i < N ; i++) {loop L2}
  Y[i] = A*X[i] + Y[i];

Of course you might want a vectorized tail loop for wide vector, so you replace L2 by a single vector statement when masking is available:

mask=generate_mask(N & ~(VW-1));
vector(mask, Y[i..i+WV-1]) = vector(mask, A) * vector(mask, X[i..i+WV-1]) + Y[mask, i..i+WV-1];

Now in V you don't know the width at compile time, so you use the result of vsetvl() for VW (SVE behave similarly):

VW=vsetvl(BIGNUMBER);
ROUNDEDN=roundNtoMultipleOfVW(N, VW); // round N to the largest integer multiple of VW that is lower or equal to N, VW may not be a power-of-two...
for (i = 0 ; i < ROUNDEDN ; i+= VW) {loop L1}
  vector(Y[i..i+WV-1]) = vector(A) * vector(X[i..i+WV-1]) + Y[i..i+WV-1];
for (/* i */ ; i < N ; i++) {loop L2}
  Y[i] = A*X[i] + Y[i];

So far, so good. Bur rather than thinking about masking to replace L2 the way it has to be done for other SIMD ISA (in V probably a vector index, a scalar broadcast & a comparison to generate the mask itself?) , the user is going to wonder why this trivial copy/paste is not legit:

MAXVW=vsetvl(BIGNUMBER);
ROUNDEDN=roundNtoMultipleOfVW(N, VW);
for (i = 0 ; i < ROUNDEDN  ; i+= MAXVW) 
  vector(Y[i..i+WV-1]) = vector(A) * vector(X[i..i+WV-1]) + Y[i..i+WV-1];
LEFTOVER=vsetvl(N-ROUNDEDN);
if (LEFTOVER>0)
  vector(Y[i..i+WV-1]) = vector(A) * vector(X[i..i+WV-1]) + Y[i..i+WV-1];

LEFTOVER is trivially smaller than VW, so it will fit in a register. It's basically the same as masking, but with the added guarantee to the hardware that it's all-1 for the LEFTOVER least-significant places, then all-0. No reason for that not to work as well as masking, from the software developer's point of view...

If fact, depending on the dynamic cost of vesetvl() to the old value (and the size of the loop, the bigger the loop, the less relatively costly the spurious vsetvl()), you might want to much simplify the code and go:

MAXVW=vsetvl(BIGNUMBER);
for (i = 0 ; i < N  ; i+= MAXVW)  {
  vsetvl(minimum(N-i, MAXVW));
  vector(Y[i..i+WV-1]) = vector(A) * vector(X[i..i+WV-1]) + Y[i..i+WV-1];
}

But for this to work, the user needs the guarantee that a vsetvl() to a value smaller than the returned value is legit...

And if it isn't - why is setting the VL needed in the first place? If the only way to deterministically reduce the width is masking, then you only need a read-only function to get the VL, and we're basically in the same situation as other SIMD ISA (... assuming we can reliably reinterpret VLEN bits from one SEW to another and that MAXVL[SEW] == VLEN/SEW, because otherwise many algorithms just won't work - at least not efficiently).

Cordially,

issuehsu commented 4 years ago

For your DAXPY example, I will write following code:

for (i = 0, n=vsetvl(N); i < N ; i+=n) { vector(Y[i..i+n-1]) = vector(A) * vector(X[i..i+n-1]) + vetor(Y[i..i+n-1]); } So you don't need a tail loop.

rdolbeau commented 4 years ago

@issuehsu This code will not work, because you still need a tail loop. If N%vsetvl(N) is not 0, your code breaks, it will overshoot the end of the arrays.

Trivial case: N=3, n=vsetvl(N)=2. Your first iteration does 0..1, the second does 2..3. Oups, there is no X[3] or Y[3]...

issuehsu commented 4 years ago

Sorry, my fault. How about n=vsetvl(N-i)?

rdolbeau commented 4 years ago

That would work, and that's what I suggested as the 'simple' way in my last code block... but it requires doing it every iteration (or just the once as a vectorized tail), and it assumes that the result will be what you requested...

issuehsu commented 4 years ago

I am not sure the loop-tail way implementation will improve performance or not. But the simple way implementation way will be more readable, and I remember that calling vsetvl in each loop will not hurt the performance (at least in SiFive's design).

nick-knight commented 4 years ago

The approach @issuehsu is suggesting is how the RVV architects intend loop strip-mining to be done in RVV. Here's the SAXPY example from the RVV spec: https://github.com/riscv/riscv-v-spec/blob/master/example/saxpy.s I see no correctness issues regarding an assumption that AVL == VL.

For more background reading, please see: https://www.sigarch.org/simd-instructions-considered-harmful/ But please note that the RVV code in this article is out-of-date.

The big picture here is that the V-extension architects really did intend for users to call vsetvli at every loop iteration. In my benchmarking experience on SiFive's vector cores, this vsetvli never causes any performance issues. (I'll admit that my FFT codes do avoid it, at least for powers-of-two problem sizes...)

rdolbeau commented 4 years ago

OK, I'm a bit rusty when it comes to pure assembly (that's why we're on the 'intrinsics' documentation), but as I read it: either the saxpy.s code is broken, or vsetvl is supposed to be very accommodating - which is my question: is it? What is guaranteed to be possible?

Because in that assembly, if vsetvl only ever returns even value in a4 (think an implementation where pairs of FP32 are handled together in the FP64 pipeline), and n is odd (so a0 starts odd), then a0 will never reach 0 and the loop never ends... the only way to avoid that is the guarantee that vsetvl will eventually return 1 (and never returns something bigger than it's input) so the loop can be its own tail...

nick-knight commented 4 years ago

@rdolbeau: I'm sorry but I'm still not grasping your concern. I know that that SAXPY code works, so it must be that vsetvli is "very accommodating".

For example, if you invoke this code with n == 3 (== a0) on a machine with VLEN == 128 (i.e., 4 floats), then vl (== a4) will be set to 3, and the loop will only run once: the sub in line 21 will set a0 to 0 and the bnez in line 28 will fall through.

One observation --- concerning performance, not correctness --- is that on the last iteration of this loop, five instructions (lines 21, 22, 23, 27, 28) could be avoided. Having a special tail-case would avoid this, at the cost of an increase in code-size. In an application domain where most of the SAXPY calls have "small" AVL like in my example, it could be worth optimizing for this case. When software-pipelining a loop, this optimization is easily implemented in the epilogue. Our hope is that intrinsics programmers don't have to worry about this: they just write normal for-loops and let the compiler's loop-nest optimization framework do the right thing.

In case you're curious, the rules for how vsetvl{i} converts AVL to VL are documented here: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#62-constraints-on-setting-vl The particular example I sketched is an example of "constraint 1".

jim-wilson commented 4 years ago

The saxpy code Nick gave also works if N is zero. VL gets set to zero in that case. The vector instructions do nothing, and the non-vector instructions are harmless. The branch at the end fails and the function returns without doing any work. In that case, you wasted maybe 10 cycles executing 10 instructions that did nothing. But if you add a test for zero and early exit, you are adding about 3 cycles to every saxpy call. Suppose N is only 0 once every one million calls. Then the extra 3 cycles for the early exit means you are wasting 2999997 cycles every million calls. If you don't add the test and early exit, then you are only wasting 10 cycles every million calls. Adding the test and early exit also increases code size. So avoiding the special case both reduces code size and improves performance, for the normal case where N will almost never be 0.

The hardware design for SIMD machines is different from RVV machines. You seem to making the assumption that if you have a four element wide vector registers, then you must have four ALUs. That is how SIMD machines are designed. But it isn't necessary for RVV machines. You can have a four wide vector reg and a single alu, you just feed the elements into the alu one at a time. And if VL is less than 4, then you stop when you have sent VL elements through the alu. If you take a trap before you finish, then you restart at vstart and end at VL. This allows for smaller cheaper implementations for embedded parts with the V extension. If you look through the rvv instruction list, you will see that there aren't any instructions that operate across elements in such a way that you would need to have more than one alu to implement it. Of course, if it is performance you care about, then you can certainly have four alus and do four operations at a time across your four wide vector register, suppressing (or zeroing?) writes for inactive elements. It just isn't a requirement to implement it that way.

There is an article that David Patterson wrote that explains how rvv works and compares it to SIMD architectures. It mentions that the RISC-V vector extension was designed so that no loop bookkeeping code is required. Maybe this helps? https://www.sigarch.org/simd-instructions-considered-harmful/ The is for an older version of the rvv spec, but the basic instructions are still the same.

rdolbeau commented 4 years ago

@knightsifive I had no concern, until in his first comment @jim-wilson said "I think we should discourage people from trying to modify a VL that is obtained from a vsetvl intrinsic". That statement, to me, implies that there is some problem in deriving alternate values of VL and feeding them back into vsetvl, otherwise why the restriction?

As far as i can tell, the issue is a communication problem. We're in an "intrinsics" context, in a high-level language. It seems to me @jim-wilson is confused and think we're talking about modifying the harware register VL (if not, then I'm lost as to why he made that statement...). We're not, the high-level interface doesn't let us access it, and rightfully so. We only access C variable and compiler-interpreted pseudo-functions (a.k.a. 'builtins' or 'intrinsics').

To me this:

vop_vv_type_vl(a, b, gvl)

(again - in both those, 'a', 'b', 'oldvl', 'gvl', 'tempvl' are C variables [a & b are vector-typed, of course], not hardware registers).

is a just shortcut (I've used the expression 'syntactic sugar' before...) for this:

oldvl=readcurrentvlforcurrentlmulandsew(); // I assume something like that is possible, though in practice the user probably know what was used before
tempvl=vsetvl(gvl, currentsew, currentlmul);
vop_vv_type(a, b);
oldvl=vsetvl(oldvl);

Obviously, for this to work as expected by the user, the requested gvl has to be 'legit' - that is, tempvl == gvl. As long as gvl <= oldvl, from the 6.2 you quoted, this code will work exactly as the user want. In fact, as long as gvl <= VLMAX, this will work just fine.

My example code in Chacha20 is slightly different. It does manipulate the VL, but also the SEW, because it's data reinterpretation. The idea is something like:

vint32m1_t a, a2, b, c;
vint64m1_t inc;
int gvl = vsetvl(FIXEDSIZEBLOCK, SEW=32, LMUL=1);
(...)
a =  vsrl_vv_i32m1(b, c, 13);
a2 =  vsll_vv_i32m1(b, c, 19);
a = vor_vv_i32m1(a, a2); // emulate 32-bits rotation
a = vreinterpret_vi64m1_i32m1(vadd_vv_i64m1_vl(vreinterpret_vi32m1_i64m1(a),inc,gvl/2));
(...)

Most operations are 32 bits, but at some point some data are interpreted as 64 bits. So we need a way to execute a small number of instructions with a different SEW/VL combinations (hope I'm getting the terminology right...), but one that is highly likely to be legit, as it's using the exact same amount of data (bits) in the registers - the product SEW*VL is constant. The only problem is if the 'gvl' above is odd, which is easy to test for. And which won't be, as long as FIXEDSIZEBLOCK is a) even and b) <= VLMAX, as in this case we're guaranteed that gvl == FIXEDSIZEBLOCK (again, 6.2 constraint 1).

@jim-wilson I'm sorry, but I've no idea why you bring up execution width, our discussion is about the high-level [well, not that high ;-)] language semantic. Hardware implementation details are only relevant if they induce semantic restrictions to the high-level language. We only care whether something we write is legal and will produce the expected result at this point - the actual performance is obviously going to be dependent on the exact behavior of the micro-architecture. This may leads to some software design choices to be more or less efficient, but that will become relevant when hardware (or at least cycle-accurate simulation) is here.

rofirrim commented 4 years ago

Hi Nick,

Personally, I have not found a use-case where I wanted/needed to be able to input AVL to an intrinsic. (@rdolbeau: I'd love to learn more about the application you have in mind.) In my opinion, passing in AVL adds unnecessary boilerplate and makes writing intrinsics code more error-prone, ultimately increasing the cognitive load on the programmer. I only plan to use the non-AVL-argument versions.

Under this approach, I guess you view VL as a C global variable that is used as an implicit operand of almost all the vector intrinsics.

Have you considered what would be the ideal behaviour when calling functions? A straightforward mapping of VL as a C global variable seems to suggest that returning from a function call would require (for correctness at least) another call to the vsetvl intrinsic (just in case the callee or an indirect one has changed VL because it is implemented using vector operations).

Calling vector functions in vector code may seem an exotic thing but it is something that OpenMP supports[1]. I know that here we're talking about intrinsics, but a programmer might want to achieve similar functionality using intrinsics.

Explicit vector length, surprisingly, makes this very clear. After a function call (and if the compiler can't prove the VL hasn't changed) a new vsetvli will be emitted. Under the model of VL a a global variable, the user must do this. May not be a big deal for an expert programmer but looks to me as one less pitfall (at expense of more typing, that is).

[1] Section 5.1 of OpenMP Examples at https://www.openmp.org/wp-content/uploads/openmp-examples-5.0.0.pdf

rdolbeau commented 4 years ago

@rofirrim Agreed. Not only that, but we also have 'omp declare variant' to allow an auto-vectorized loops to call a hand-written vector variant of a scalar function (either because the compiler fails to auto-vectorize said function, or because it's part of a tuned library, ...). Such hand-written vector variant is likely to be written using intrinsics. The toolchain & the vector ABI will need to support that as well...

kito-cheng commented 4 years ago

I think vl and vtype just like another register need specify its callee save or caller save by ABI, if we decide caller save, then compiler should save/restore around the call, if callee save, then save and restore at prologue / epilogue.

jim-wilson commented 4 years ago

On Thu, Apr 16, 2020 at 11:40 PM Romain Dolbeau notifications@github.com wrote:

Most operations are 32 bits, but at some point some data are interpreted as 64 bits. So we need a way to execute a small number of instructions with a different SEW/VL combinations (hope I'm getting the terminology right...), but one that is highly likely to be legit, as it's using the exact same amount of data (bits) in the registers - the product SEW*VL is constant. The only problem is if the 'gvl' above is odd, which is easy to test for. And which won't be, as long as FIXEDSIZEBLOCK is a) even and b) <= VLMAX, as in this case we're guaranteed that gvl == FIXEDSIZEBLOCK (again, 6.2 constraint 1).

The problem here is that you have designed intrinsics that can only work if the user checks a number of preconditions first. The user has to know VLMAX and has to check AVL against VLMAX before calling vsetvl intrinsics, etc. This code isn't portable and isn't safe. And this is not how the architecture was designed to be used. If someone forgets to check a precondition, or didn't know that such tests were necessary, then their code fails in mysterious ways. I want intrinsics that always work, without preconditions, that encourage people to program the machine the way it was designed to be programmed. We can get that if we say that the vl output of the vsetvl intrinsic can't be modified, can't be passed back into a vsetvl intrinsic, and that code can't make any assumptions about the value.

That leaves open the question of how to get exactly what you want. I haven't thought enough about your problem to solve that. But maybe we can have a default safe mode where vl can't be modified and an optional unsafe mode where vl can be modified but programmers must manually add checks to verify that their code can work.

Jim

rdolbeau commented 4 years ago

@jim-wilson I think we have a completely different view of intrinsics. Of course they're not portable: they're tied to one extension of one architecture! And they're not as easy to use and safe as pure high-level code, because they're designed to give a[n even] lower-level access, so that you can use feature the high-level code wasn't designed to handle. Sometimes, you can shoot yourself in the foot, yes. But then, so can you with C pointers or a truckloads of other features. I have seen many super-savvy customers not coming anywhere close to intrinsics, because it's too time-consuming and difficult to get right and fast. They want tools, libraries, and failing that, they used to call me ;-) ;-) ;-)

Intrinsics are basically a way to reduce the complexity of assembly programming. The compiler handles the calling convention. It handles the register allocation and the spilling, etc. Ideally for me, it insulates you from some of the irregularity of the architecture by 'orthogonalizing' it. The only question is how much insulation do you need - at some point, the insulation is too thick and may cause performance concern, because it hides the truth... To go back to the egregious SSE example from https://github.com/sifive/rvv-intrinsic-doc/issues/9#issuecomment-615183129, Intel doesn't offer an 'insulation' from that problem - there is no 'semantic' intrinsic for 8-bits shift. It's fairly easy to emulate, so Intel lets you do it, perhaps so that you know it's not as fast as an hardware operation. On the other hands, they let you call the vectorized cosine as if it were an intrinsics (and happily document the SVML in the intrinsics guide https://software.intel.com/sites/landingpage/IntrinsicsGuide), because that's not something the average developer can do (fast, accurate range reduction is hard).

And intrinsics are not used to do it "the way it was designed to be programmed". They're used to push the architecture to it's limit, either performance or semantic, so that it does things it wasn't designed to do in the first place.

What you describe as a lot of value for developers, but it's not called 'intrinsics'. It's called 'OpenMP SIMD' (and even with that, you can easily mess code up if it wasn't parallel-safe to begin with).

Edit: To clarify that last point, I'm thinking about the various clauses (uniform, linear, etc.) to 'omp declare simd', which allows a reasonably safe way to direct vectorization of functions and loops so that they are more efficient without writing them by hand (though sometimes ultimately the compiler is not 'good enough', and you fall back on an hand-made implementation with 'declare variant' and use intrinsics even with OpenMP ;-) )

nick-knight commented 4 years ago

Hi @rdolbeau , thanks for the cool code example!

vint64m1_t inc;
int gvl = vsetvl(FIXEDSIZEBLOCK, SEW=32, LMUL=1);
(...)
a =  vsrl_vv_i32m1(b, c, 13);
a2 =  vsll_vv_i32m1(b, c, 19);
a = vor_vv_i32m1(a, a2); // emulate 32-bits rotation
a = vreinterpret_vi64m1_i32m1(vadd_vv_i64m1_vl(vreinterpret_vi32m1_i64m1(a),inc,gvl/2));
(...)

First, regarding inputting AVL (here, gvl/2) as an argument: as you've clearly explained, this is just syntactic sugar, and you could easily implement this with an extra invocation (or two) of a "SETVL" intrinsic. (This is what the generated assembly will look like, anyway.) I also agree that it's possible to add "safety checks" to make this work.

Where I disagree is that this type of syntactic sugar belongs in "version 1" of our intrinsics library. I would prefer to start with a lower-level approach, roughly mapping 1-to-1 with assembly instructions, and then build on top of that in future iterations. I'm perfectly OK with the idea of adding higher-level features: however, since the V-extension is already rather complicated, I'd like to keep the starting point as simple as possible.

Second, regarding your specific code example, I would actually consider a different approach, implementing the 64-bit addition using the add-with-carry instructions. (There are a few different ways of going about this.) A practical advantage of this approach is that it would also work on platforms with ELEN == 32, whereas your approach requires ELEN >= 64. Additionally, a (possibly huge) performance advantage is that this approach generalizes to LMUL > 1, whereas your approach (and this kind of type-punning more generally) exposes implementation-dependent behavior. In particular, your code will get the wrong answer if you tried to increase LMUL past 1 on a machine with SLEN == 32. On the other hand, if you are tuning your library for a particular machine with SLEN and ELEN both > 32, then I suspect (the LMUL = 8 generalization of) your program would ~be competitive with~ outperform any of the approaches I have in mind.

Edit: With the addition of "fractional" LMUL in RVV 0.9, the vector register file layout for SLEN < VLEN is expected to change substantially, and my earlier comment regarding @rdolbeau's Chacha type-punning needs to be revised:

Additionally, ~a (possibly huge) performance advantage is that this approach generalizes to LMUL > 1, whereas~ your approach (and this kind of type-punning more generally) exposes implementation-dependent behavior. In particular, your code will get the wrong answer if ~you tried to increase LMUL past 1 on a machine with SLEN == 32~ SLEN < VLEN. On the other hand, if you are tuning your library for a particular machine with ~SLEN and ELEN both > 32~ SLEN = VLEN, then I suspect (the LMUL = 8 generalization of) your program would outperform any of the approaches I have in mind.

This change is contentious and may be revised further in committee.

nick-knight commented 4 years ago

@rofirrim @rdolbeau : Good questions regarding calling conventions! Everyone certainly wants this to work well with multithreading, including OpenMP, so we'll have to address this.

There's been quite a bit of discussion about RVV calling conventions already: I'll have to defer to the experts, including @kito-cheng and @ebahapo (@kito-cheng's earlier post mentioned some of the salient details). Is there a public document regarding RVV calling conventions? Perhaps this conversation belongs over at https://github.com/riscv/riscv-elf-psabi-doc.

rdolbeau commented 4 years ago

@knightsifive I completely agree that keeping things simple is better, and that the more complex aspects (semantic, perhaps even VL-including intrinsics) don't belong in version 1 of the /implementation/. However, I think they very much should part of even the first version of the /specifications/, to make sure the naming scheme and behavior won't cause issue later...

For the OpenMP aspect, it's not so much the multithreading part of OpenMP, but the SIMD part, i.e. when you have:

foo() {
for (i = 0; i<n;i++) bar();
}

with bar() in a different file from foo(), OpenMP SIMD will let you do:

foo() {
#pragma omp simd
for (i = 0; i<n;i++) bar(a, b, i);

and

#pragma omp declare simd
bar(x, y, n)

Which will generate a vectorized version of bar() that can be called efficiently from the vectorized version of the loop in foo(). This requires a vector ABI to efficiently pass parameters, including with the various vectorization-tuning clauses (uniform, linear, ...).

rofirrim commented 4 years ago

Hi @kito-cheng,

I think vl and vtype just like another register need specify its callee save or caller save by ABI, if we decide caller save, then compiler should save/restore around the call, if callee save, then save and restore at prologue / epilogue.

So it looks like, from a C programmer point of view (this is: oblivious of ABIs or registers), the value of VL is preserved across function calls. So in practice, for that same C programmer, it behaves like an implicit parameter/local variable of the current function, doesn't it? Perhaps I got you wrong here.

Are we sure we want to expose this detail to the programmer? Does VL have to be preserved and then restored across function calls when using intrinsics? If we determine the ABI has to change (e.g. VL is not preserved anymore), wouldn't that turn previously valid programs (which didn't have to restore VL) into invalid ones? A change on the other direction would not run into correctness issues, but now the code may be restoring VL values that were already preserved (i.e. is inefficient).

Explicit vector length can get away with these issues because we don't have to choose between making VL as either a global variable (not preserved upon returning a function call, like fcsr) or as an implicit function parameter/local variable (preserved upon returning a function call, say like call-preserved GPRs). It is almost like another operand of the vector operation.

kito-cheng commented 4 years ago

Hi @rofirrim

Make VL and VTYPE is callee-save or caller-save is not exposing the detail to the programmer, it's just specify the behavior/expectation of VL and VTYPE after function call, if we don't write down any word in ABI, what's the behavior should programmer expect for VL after function call? VL could use it directly after function call, or need to save-restore around function call?

VL like a global variable, that mean VL might changed/clobber after function call, which is caller-save, and it won't break the current existing assembly implementation.

But, of cause it will break existing program if we define VL and VTYPE to callee-save, so I think define it as caller-save is OK and compatible with existing vector program.

Explicit vector length can get away with these issues because we don't have to choose between making VL as either a global variable (not preserved upon returning a function call, like fcsr) or as an implicit function parameter/local variable (preserved upon returning a function call, say like call-preserved GPRs). It is almost like another operand of the vector operation.

I think we still need specify that during code gen, since we want eliminate redundant vsetvli, then we must specify the behavior of vl and vtype after function call, take an example to briefly why two things are related:

Consider we are trying to eliminate redundant vsetvli:

avl = vsetvli_32m1 (n);
va = vadd_i32m1_vl (vb, vc, avl);
vf = vadd_i32m1_vl (vd, ve, avl);
vg = func_call (va, vf);
vi = vadd_i32m1_vl (vh, vg, avl);

Code gen, using symbolic operand instead of real register name here:

vsetvli avl, n, e32, m1
vsetvli avl, avl, e32, m1
vadd va, vb, vc
vsetvli avl, avl, e32, m1
vadd vf, vd, ve
call func_call
vsetvli avl, avl, e32, m1
vadd vi, vh, vg

The 2nd and 3rd vsetvli can eliminate obviously, but last one can't if we define vl and vtype as caller-save/global variable.

vsetvli avl, n, e32, m1
vadd va, vb, vc
vadd vf, vd, ve
call func_call
vsetvli avl, avl, e32, m1
vadd vi, vh, vg

But it's possible to remove if we define vl and vtype as callee-save, but the price is we must save/restore that on prologue/epilogue.

vsetvli avl, n, e32, m1
vadd va, vb, vc
vadd vf, vd, ve
call func_call
vadd vi, vh, vg

So I think Local variable/global variable view is higher abstraction level view, which is different view from the ABI level, both are not conflict concept.

FCSR is good example since the situation is similar to the vl and vtype but also an bad example because LLVM and GCC implementation didn't model that precisely for RISC-V back-end :P

Hsiangkai commented 4 years ago

Hi @rdolbeau , thanks for the cool code example!

vint64m1_t inc;
int gvl = vsetvl(FIXEDSIZEBLOCK, SEW=32, LMUL=1);
(...)
a =  vsrl_vv_i32m1(b, c, 13);
a2 =  vsll_vv_i32m1(b, c, 19);
a = vor_vv_i32m1(a, a2); // emulate 32-bits rotation
a = vreinterpret_vi64m1_i32m1(vadd_vv_i64m1_vl(vreinterpret_vi32m1_i64m1(a),inc,gvl/2));
(...)

First, regarding inputting AVL (here, gvl/2) as an argument: as you've clearly explained, this is just syntactic sugar, and you could easily implement this with an extra invocation (or two) of a "SETVL" intrinsic. (This is what the generated assembly will look like, anyway.) I also agree that it's possible to add "safety checks" to make this work.

Where I disagree is that this type of syntactic sugar belongs in "version 1" of our intrinsics library. I would prefer to start with a lower-level approach, roughly mapping 1-to-1 with assembly instructions, and then build on top of that in future iterations. I'm perfectly OK with the idea of adding higher-level features: however, since the V-extension is already rather complicated, I'd like to keep the starting point as simple as possible.

Second, regarding your specific code example, I would actually consider a different approach, implementing the 64-bit addition using the add-with-carry instructions. (There are a few different ways of going about this.) A practical advantage of this approach is that it would also work on platforms with ELEN == 32, whereas your approach requires ELEN >= 64. Additionally, a (possibly huge) performance advantage is that this approach generalizes to LMUL > 1, whereas your approach (and this kind of type-punning more generally) exposes implementation-dependent behavior. In particular, your code will get the wrong answer if you tried to increase LMUL past 1 on a machine with SLEN == 32. On the other hand, if you are tuning your library for a particular machine with SLEN and ELEN both > 32, then I suspect (the LMUL = 8 generalization of) your program would ~be competitive with~ outperform any of the approaches I have in mind.

Agree with @knightsifive. If you want to change SEW and VL for your operations, you should invoke vsetvl intrinsics by yourself. vsetvl is the only intrinsic to give the value for "the number of elements to be updated by a vector instruction." I prefer not to provide an option to let users modify vl by arbitrary arithmetic.

rofirrim commented 4 years ago

Hi @kito-cheng,

VL like a global variable, that mean VL might changed/clobber after function call, which is caller-save, and it won't break the current existing assembly implementation.

That is the part that confuses me. Either defining it caller-save or callee-save would preserve it across function calls. I am assuming "caller-save" here means "the caller must preserve it". Perhaps you mean something like "the caller must preserve it if it wants to continue using the old value" which is what a compiler would do. It is less clear to me what a user should do.

This is where I believe the model of "global variable" becomes problematic: a programmer could always have to assume VL has been clobbered after a function call (after all VL is a global variable isn't it). If this is not the case, then VL is not a true "global variable" as it seems to behave like something else.

My concern is that this ABI detail could leak onto the user of the C intrinsics. I can't but wonder if the culprit is looking at VL as an implicit entity at this level rather than just another input operand of the vector operation (which due to constraints of the architecture needs to go to a register but this is fine!).

rofirrim commented 4 years ago

Agree with @knightsifive. If you want to change SEW and VL for your operations, you should invoke vsetvl intrinsics by yourself. vsetvl is the only intrinsic to give the value for "the number of elements to be updated by a vector instruction." I prefer not to provide an option to let users modify vl by arbitrary arithmetic.

I agree that vsetvl will be the main way to change vl. But note that "Fault-Only-First Loads" also change vl.

A fact that, unfortunately, won't be obvious at all from an intrinsic API with an implicit VL :confused:

rofirrim commented 4 years ago

I prefer not to provide an option to let users modify vl by arbitrary arithmetic.

I'd also be conservative here.

But note that explicitly computing a smaller VL2 (based on a previous VL1 obtained by vsetvl with SEW1, LMUL1 that defines VLMAX1) and using it in an operation with a SEW2, LMUL2 configuration that doesn't reduce the VLMAX1 should be fine, if I read the spec correctly.

Something like this seems correct to me.

VL1 = vsetvli(..., SEW1, LMUL1); // (SEW1, LMUL1) define VLMAX1 
VL2 = F(VL1) // F is such that 0 <= VL2 <= VL1
vop_SEW2_LMUL2(..., VL2); // this is OK if (SEW2, LMUL2) define VLMAX2 >= VLMAX1

I think this is a reasonable assumption. What do you think?

Assuming my interpretation of the spec is correct, then I agree not all general arithmetic is correct, but a subset of it might be valid. However I don't expect the compiler to devote any time checking that the arithmetic is reasonable (in other words. "no diagnostic required").

rdolbeau commented 4 years ago

@rofirrim I agree with you about the compiler not needing to check, it's the user job. And I don't really get the whole "no arithmetic on VL".

When you write this:

request = estimate_vector_needed(parameter);
vl = vsetvl(request);

What difference does it make that 'parameter' was derived from an older vl or not ? You still call vsetvl() and get a usable result...

The only difference when you put the VL as a parameter of the intrinsics is that you don't have an explicit result, the VL save/restore (I insist on the restore!) is implicit. When passing X as a VL to an intrinsic, it's the user responsibility to either

a) make sure vsetvl(X) == X, so the code does what it looks like it does; or b) make sure the code work nonetheless otherwise; or c) not use the code in circumstances where it would break.

Also known as, "business as usual" :-)

@kito-cheng For the ABI, an important question is how to make work construct like this:

/* file 1 */
foo() {
#pragma omp simd simlen(XYZ)
for (i = 0; i<n;i++) bar(a, b, i);
/* file 2 */
#pragma omp declare simd
bar(x, y, n)

In this case, the settings for bar() must be inherited from the settings in the loop in foo(), otherwise things will break... the ABI/calling convention has to make sure this works. Same goes for masking or explicit vectorized functions. As a example, the current state of things for SVE is documented here: https://developer.arm.com/docs/101458/2000/vector-math-routines/interface-user-vector-functions-with-serial-code.

Edit: An example for 'declare simd' / 'declare variant' in https://github.com/HydroBench/Hydro/blob/openmp-declare/HydroC/HydroC99_2DMpi/Src/riemann.c

kito-cheng commented 4 years ago

Hi @rofirrim : Oh, are you meaning global variable is more like nobody-save instead of callee-save or caller-save? If so, I think it would be burden to implicitly VL model, and function call would be a optimization barrier to all vector operations.

For implicitly VL model, nobody-save would cause user must write down vsetvli again after function call, that's exposing more detail to programmer to me:

avl = vsetvli_32m1 (n);
va = vadd_i32m1_vl (vb, vc);
vf = vadd_i32m1_vl (vd, ve);
vg = func_call (va, vf, avl);
avl = vsetvli_32m1 (avl);  // User must reset VL here, because VL might changed after call and compiler won't save.
vi = vadd_i32m1_vl (vh, vg);
kito-cheng commented 4 years ago

Hi @rdolbeau :

I think it could be resolve be adding an extra AVL, and an explicitly vsetvli call at the beginning of function , the vsetvli it hard to remove since we never know what the value in the vtype if there is mixing-width in argument[*ex1], it like SVE using an extra argument to passing mask in the example.

Maybe there is some point I didn't consider? Do you mind share the thought about this?

example 1:

vint64m2_t foo(vint64m2_t, vint32m1_t, size_t avl)
{
  // vtype = SEW=64/LMUL=2 or SEW=32/LMUL 1 ??
}

Another problem is how to passing the AVL value, pass it directly in VL register, or passing in GPR like other integer argument, but that's little out of scope of this issue (I mean the original issue, how to passing vl as argument of intrinsic), so I think it should have a new issue for ABI discussion.

Hsiangkai commented 4 years ago

I prefer not to provide an option to let users modify vl by arbitrary arithmetic.

I'd also be conservative here.

But note that explicitly computing a smaller VL2 (based on a previous VL1 obtained by vsetvl with SEW1, LMUL1 that defines VLMAX1) and using it in an operation with a SEW2, LMUL2 configuration that doesn't reduce the VLMAX1 should be fine, if I read the spec correctly.

Something like this seems correct to me.

VL1 = vsetvli(..., SEW1, LMUL1); // (SEW1, LMUL1) define VLMAX1 
VL2 = F(VL1) // F is such that 0 <= VL2 <= VL1
vop_SEW2_LMUL2(..., VL2); // this is OK if (SEW2, LMUL2) define VLMAX2 >= VLMAX1

I think this is a reasonable assumption. What do you think?

Assuming my interpretation of the spec is correct, then I agree not all general arithmetic is correct, but a subset of it might be valid. However I don't expect the compiler to devote any time checking that the arithmetic is reasonable (in other words. "no diagnostic required").

I don't think there is such restriction for AVL. My concern is some vl is set by vsetvl provided by users, and some vl is set by vsetvl generated by the compiler. I am not sure it is a good practice or not. I think it is simpler to restrict setting vl through vsetvl intrinsics. At least, it is consistent with V specification.

Another point I want to point out is to reinterpret data in vector registers is dangerous. As I mentioned in the mail thread, SLEN will change the layout of data. In addition, it permits SEW > SLEN in V specification.

The example is

VLEN=128b, SLEN=32b, SEW=32b, LMUL=4

Byte          F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n                C       8       4       0   32b elements
v4*n+1              D       9       5       1
v4*n+2              E       A       6       2
v4*n+3              F       B       7       3

When you want to operate data in the same registers with different SEW, in this case four V registers. We set SEW to 64 and use vl/2 as the new AVL. It is similar to Romain's intention. The V operations will operate on (4, 0), (5, 1), (6, 2), ... instead of (1, 0), (3, 2), (5, 4)....

There is no such problem for LMUL = 1. To reinterpret data, you will be restricted to program under LMUL = 1. You could refer to https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#42-mapping-with-lmul--1.

rdolbeau commented 4 years ago

@Hsiangkai Reinterpreting data is not 'dangerous', it's 'difficult' :-) As long as the behaviour is deterministic (and it is), then it's a matter of the developer not messing things ups. Which is their problem, not yours/ours. A low-level interface like intrinsics is there to give access to he hardware capabilities, it's not meant to be a safe environment. Higher-level abstractions will offer the safe environment.

And I disagree that reintepretation should be limited to LMUL==1; in fact, the documentation you pointed me at gave me an idea. If you look at the chacha20 example that I pointed at there's a final stage that, in the current implementation (it can be done with strided L/S instead with a different performance profile) starts by transposing 4x4 blocks of 32 bits data in 4 registers, from

v1: ABCD ....
v2: EFGH ...
v3: IJKL ...
v4: MNOP ...
v5: QRST ...
v6: UVWX ...
...

to

v1: AEIM ...
v2: BFJN ...
v3: CGKO ...
v4: DHLP ...
V5: QU... ...
v6: RV... ...
...

In AVX-512 where the length of the vector is known and the instructions are reasonably fast the four 4x4 128 bits block are then 'transposed' so the first register holds AEIMQU... (what was originally the first 'column' of 32 bits data) that you can store densely in memory. In SVE or V, this 'full transpose' is skipped (the register may be too small, and the instructions to manipulate 128 bits block don't exist), and the 128 bits blocks directly are 'scatter'ed in memory. However, from the V specifications, it should be possible to partially 'transpose' the data by 'coalescing' 4 registers (VL=X,SEW=32,LMUL=1) into a single register of (VL=4*X,SEW=32,LMUL=4). Then if SLEN=128-bits as in the spec's example (unfortunately, again that's an implementation detail :-( ) interleaving means that a single store would do some of the de-interleaving and storing of the data, perhaps easier and faster than either current solutions (I would need LMUL=16 to actually do everything at once with a single dense store, I think, if you needed a use case for LMUL=16 & SLEN=128 :-) ).

Whatever the hardware is capable of doing, ultimately someone will find a use for it that the designers didn't think about. The tools shouldn't prevent them from doing it...

nick-knight commented 4 years ago

@rdolbeau: As you've described, certain SLEN-dependent permutations are achievable by register-group fusion/fission. I think we agree that these techniques are not "portable", as well as that this isn't a convincing argument that intrinsics programmers should be barred from using them.

The main challenge I see is how to expose these techniques in a way that's humane to our users. In particular, to fuse smaller register groups into a larger group, the input registers must be appropriately numbered, e.g., we can fuse v10 and v11 into [v10, v11]. But v9 and v10 cannot be similarly fused. The register allocator needs to be made aware of this, else vector-vector copies will be required making these techniques much less desirable. I anticipate that supporting these techniques will involve adding fusion/fission intrinsics and nontrivial extensions to the register allocator and (perhaps) the type system. However, this is pretty far outside my research area.

ebahapo commented 4 years ago

vl should be callee saved. It is not practical for every non leaf function to save and restore it just in case. Rather, a function that modifies the vl should save the previous value at entry and restore it at exit.

It is not like the fcsr is typically used, as its scope is global.

rofirrim commented 4 years ago

vl should be callee saved. It is not practical for every non leaf function to save and restore it just in case. Rather, a function that modifies the vl should save the previous value at entry and restore it at exit.

So from a C programmer view (again, oblivious of ABIs and registers) vl behaves like an implicit parameter (by value) of the current function which would be implicitly passed to all the v-ext intrinsics?

Something like this

void bar (/* size_t vl_bar */) {
  // Intentionally setting vl
  vl_value = vsetvl(avl, sew, lmul);
  // vsetvli can only change the local vl_bar
  vadd(v1, v2 /*, vl_bar */ );
}

void foo(/* size_t vl_foo */) {
 // intentionally _not_ setting vl
 v3 = vadd(v1, v2 /*, vl_foo */);
 bar(/* vl_foo */);
  // vl_foo is the same as before the call to bar
 v5 = vadd(v3, v4 /*, vl_foo */); 
}

I know this might be at odds as "vl is a register" but by preserving it across function calls, I understand this is the behaviour we're actually giving to it (for a C programmer that uses the implicit intrinsics).

rdolbeau commented 4 years ago

@knightsifive The design with the *LEN/LMUL/... is not simple, so exploiting it from the higher-level language is not going to be easy. And we're discussing documentation/specification, so as long as it's "implementable" the easiness of implementation is not our concern ;-)

A bit more constructively:

a) it is sufficiently complicated that an interface that isn't homogeneous in behaviour across *LEN/LMUL/... might be acceptable if homogeneous is not achievable. Then I would suggest that every intrinsics converting between types spew out a warning (and perhaps the associated -fno-warn-unsafe-v-builtins for the most daring users)

b) when it comes to register coalescing rules, it's a performance problem not a semantic problem. _mm512_4fmadd_ps has the same issue, the inputs must be consecutive registers. The high-level intrinsic hides that, it's the compiler problem to either properly allocate the registers (yeah!) or to add the necessary copies (darn...). The compiler doesn't have to be perfect straight away, this will be filled as an enhancement issue.

When it comes to the fusion/fission intrinsics (which I will boldly call 'the nuclear option' :-) ), it seems to me the first question is: how much specificity should be baked in? In https://github.com/sifive/rvv-intrinsic-doc/issues/9#issuecomment-615100875 and https://github.com/sifive/rvv-intrinsic-doc/issues/9#issuecomment-615171839, I suggested 'hiding' the XLEN from the users by having '1-1' intrinsics (where semantic == XLEN) and 'semantic' intrinsics (to support semantic != XLEN). I'm not sure that's doable here, because unlike XLEN, we obviously don't know *LEN at compile time, at least not by default (some compilers will eventually add options to 'suggest' the ELEN and SLEN to compile for, and throw away compatibility... but we're not there yet).

If we follow the current naming rules proposal, then the two things that would already be baked in the names are the LMUL and the SEW, which are user-selected. To fully determine the data organisation in registers, we also need to know the SLEN and ELEN, which are implementation-dependent. There's four options here about adjusting the naming to clarify the semantic:

a) add neither b) add SLEN c) add ELEN d) add both

I think b) and c) don't make much sense, as half-specifying things doesn't help. So what's left is:

a) vint64m4_t vgroup_v_i64m4(vint64m1_t, vint64m1_t, vint64m1_t, vint64m1_t); and vint64m1_v_ungroup1_i64m1(vint64m4_t, int index);

d) vint64m4_t vgroup_E64S128_v_i64m4(vint64m1_t, vint64m1_t, vint64m1_t, vint64m1_t); and vint64m1_v_ungroup1_E64S128_i64m1(vint64m4_t, int index);

The second should be semantically fully specified (pair of 64 bits integer value are interlaced, because we know ELEN=64 and SLEN=128). However, the question remains: what really happens? Do the interleaving happens, or is it just reinterpreting?

i.e. assuming VLEN=256, case d) above:

int64_t x = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
 vint64m1_t a = LD(&x[0]), b = LD(&x[4]), c = LD(&x[8]), d = LD(&x[12]);
vint64m4_t e = LD(x);
vint64m4_t f = vgroup_E64S128_v_i64m4(a, b, c, d);

Are e and f identical (vgroup does the interleaving), or are they different (vgroup is just a reinterpret function)? The highly qualified name makes it feel like there's some reorganization ongoing, when the implementation (and use case!) probably want to be simpler. And the issue remain: what happens on hardware where SLEN != 128 ? I'm not a fan.

The first case is a lot simpler to deal with, but will behave differently on different hardware. Again assuming VLEN=256 for sake of simplicity:

int64_t x = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
vint64m1_t a = LD(&x[0]), b = LD(&x[4]), c = LD(&x[8]), d = LD(&x[12]);
vint64m4_t e = LD(x);
vint64m4_t f = group_v_i64m4(a, b, c, d);

There's nothing implied from my point of view. You had 4 vint64m1_t, now you have a single vint64m4_t; as mentioned above, it might imply some extra copies if the register allocator isn't up-to-snuff with the fusion. The relationship between e and f is still a bit murky, but anyone who has read the documentation will know that the LD() can change behaviour depending on ELEN/SLEN, while to me group_v_i64m4 doesn't - it's just renaming, and there's no implicit interleaving here - what you see is what you get: a group of 4 registers, LMUL=4, and the data in-registers are physically unchanged from what they were before. So doing ST(x,e) will not change the content of x on any hardware, while doing ST(x,f) will change x to {0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15} if SLEN=128, and to {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15} if SLEN=64 (.. unless I messed up the interleaving semantic between LD and ST, I said is was complicated ;-) ). ST(x,f) doesn't change x if SLEN >= 256.

And the reinterpret between vint64m1x4_t and vint64m4_t is trivial, it's just two different names for the same physical thing (4 registers), so:

vint64m4_t a = LD(x);
vint64m1x4_t b = vldseg4e(x); // from Zvlsseg
vint64m4_t c = vreinterpret_v64m1x4_i64m4(b); /* just rename, physically c == b always */

means that if SLEN=64 then a == c, otherwise a !=c because the Zvlsseg interleaving is always per-element, not per-SLEN.

Personally, my choice would be case a) and the behaviour I just described, with every single vgroup* and vungroup (and vreinterpret_mAxB_*mC where B!=1 / A!=C) spewing a warning unless explicitly disabled at compile time, because they will be /dangerous/. Their behaviour is not implementation-dependent (they are just renames, after all), but the data they allow reinterpretation between is organized in a implementation-dependent way, so they allow to write code with implementation-dependent behaviour. Which, again, is dangerous and complicated, but again, should not be prohibited.

rdolbeau commented 4 years ago

On the original subject, what is that:

avl = vsetvli_i64m2 (n);
(...)
va = vadd_i32m1_vl (vb, vc);

Going to execute? The compute intrinsics re-specify two out of three implicit parameters (SEW, LMUL), but leave the third (VL) out... It doesn't make much sense to me not to specify the third as well. At least if the vadd() had a VL in it, the intent of the user would be clear...

rdolbeau commented 4 years ago

The more I think about it, the more I think having LMUL/SEW in the intrinsics but not VL isn't coherent, and in fact downright bad. But first, let's have a history lesson (sorry... old timer here ;-) ) to give context to what I mean.

The original designs for intrinsics were for SSE & AltiVec; later came AVX, NEON, etc. All those instruction sets have something in common: they specify the data type in the opcode. So to generate the 'proper' opcode, you have two options:

a) you specify the data type in the name of the intrinsic; b) you infer the data type from the parameters.

AltiVec chose b) - vec_add() will emit different opcodes depending on parameters.

SSE chose a) - _mm_add_epi16() will work on 16-bits value, the parameter isn't typed beyond 'integer' (__m128i). AVX* follows.

NEON chose both - vadd_u16() explicitly works on 16-bits value, and requires vectors of 4 unsigned 16-bits values as parameters (uint16x4_t). SVE follows.

V is different - it doesn't specify the data type in the opcode, beyond 'integer'/'float'. vadd.vv works for all integer data types, and rely on vsew to figure out what to do.

If the intrinsics followed suits, they would look like that:

v_integer_vector a, b, c;
v_float_vector x, y, z;
int avl = vsetvl_i32m1(X);
a = vadd_vv(b, c); // work on a vector of int32_t
x = vfadd_vv(y, z); // might work on a vector of float (we set VL for integer, not float)
int avl2 = vsetvl_i64m1(X);
a = vadd_vv(b, c); // work on a vector of int64_t
x = vfadd_vv(y, z); // might work on a vector of double (we set VL for integer, not float)

The computational intrinsics doesn't concern itself with SEW, LMUL or VL, because they are all implicits and set in the vsetvl. 1-1 mapping becomes very natural, and there's no possibility of ambiguity. It's also a very weak typing system (even weaker than Intel's, which distinguish between FP32 and FP64). It's basically inline assembly with no added (syntactic) sugar. I'm not a fan - but it's in my opinion the closest to the hardware designers' intent. It would also be quite difficult to use, I believe.

More problematic - what is the size of v_integer_vector or v_float_vector? Without some SEW/LMUL specifications, they can represent anything from a single byte (SEW=1, LMUL=1, VL=1) to 8 full vectors (SEW=maximum supported, LMUL=8, VL=maximum supported). I'm not sure how implementable that is - someone might write:

v_integer_vector a, b, c;
int avl = vsetvl_i32m1(X);
a = vadd_vv(b, c); // work on a vector of int32_t
int avl2 = vsetvl_i32m2(X);
a = vadd_vv(b, c); // work on *two* vectors of int32_t, interleaved by SLEN bits...

I'd guess we probably can't abstract an assembly register with a C type at all, because the C type might require multiple assembly registers to be implemented in some cases...

So we may need the LMUL to be visible from the data type, to work around this issue (as both BSC & SiFive did). Which pushes the LMUL into the intrinsics (for SiFive; BSC infers it from the datatype), so they can work on the datatype w/o overloading. Unfortunately:

v_integer_vector_m1 a, b, c;
int avl = vsetvl_i32m1(X);
a = vadd_vv_m1(b, c); // work on a vector of int32_t
int avl2 = vsetvl_i32m2(X);
a = vadd_vv_m1(b, c); // what happens here?

An we already have the issue of what's going on if there's a discrepancy in LMUL between the vsetvl() and the vadd()...

There's probably some good argument why the SEW has to appear in the intrinsics and data types as well, as both SiFive and BSC put it there :-)

Ultimately, there's one solution that isn't ambiguous (and is implementable without too much headache since except for scalability, everything is known about the data type; that's what BSC did so far) in my mind:

vint32m1_t a, b, c;
a = vadd_vv_i32m1(b, c, requested_vl);

Sure, the requested_vl might not be granted. So what? malloc() can fail, too. Part of the specifications is that the user should use vsetvl_i32m1() first to figure out a safe value for requested_vl, and then use it appropriately everywhere. So s/he can do:

int vl_for_i32 = vsetvl_i32m1(MAX_INT);
int vl_for_i64 = vsetvl_i64m1(MAX_INT);
int vl_for_f32 = vsetvl_f32m1(MAX_INT);
int vl_for_f64 = vsetvl_f64m1(MAX_INT);
int vl_for_f64_small_loops_unrolled_4 = vsetvl_f64m4(MAX_INT);

(...)

vint32m1_t a, b, c;
a = vadd_vv_i32m1(b, c, vl_for_i32);
vfloat64m1_t x, y, z;
x = vfadd_vv_f64m1(y, z, vl_for_f64);

Then it's the compiler job to insert the appropriate vsetvl() in the flow (and to remove the redundant ones) to ensure everything work as intended.

Cordially,

rdolbeau commented 4 years ago

@knightsifive For the record following your comment about "add-with-carry (https://github.com/sifive/rvv-intrinsic-doc/issues/8#issuecomment-615358742), BSC kindly added support in their toolchain & it indeed seems to work fine to generate the Chacha20 counter (full version still running). In fact, it's a lot nicer than any other implementation :-) (the counter needs to be splitted in two halves afterward anyway, so the 2x32 bits+carry propagation is more streamlined).

Edit: to make life easier, codes are available in https://github.com/rdolbeau/EPI-test-codes-vector/

rofirrim commented 4 years ago

vl should be callee saved. It is not practical for every non leaf function to save and restore it just in case. Rather, a function that modifies the vl should save the previous value at entry and restore it at exit.

It is not like the fcsr is typically used, as its scope is global.

Also I forgot to mention this which states that everything (except vxrm and vxsat) is caller-saved.

https://riscv.github.io/documents/riscv-v-spec/#_calling_convention

ebahapo commented 4 years ago

So from a C programmer view (again, oblivious of ABIs and registers) vl behaves like an implicit parameter (by value) of the current function which would be implicitly passed to all the v-ext intrinsics?

...

I know this might be at odds as "vl is a register" but by preserving it across function calls, I understand this is the behaviour we're actually giving to it (for a C programmer that uses the implicit intrinsics).

Yes, methinks that this is preferable, from a performance perspective. Then, a function that uses the V intrinsics, but fails to set the vl, should raise at least a warning, possibly an error.

vl should be callee saved. It is not practical for every non leaf function to save and restore it just in case. Rather, a function that modifies the vl should save the previous value at entry and restore it at exit. It is not like the fcsr is typically used, as its scope is global.

Also I forgot to mention this which states that everything (except vxrm and vxsat) is caller-saved.

https://riscv.github.io/documents/riscv-v-spec/#_calling_convention

Ugh! This is overkill and I feel that it should be changed.

ebahapo commented 4 years ago

Please, bear with me as I think out loud...

We have this example:

vint32m1_t a, a2, b, c;
vint64m1_t inc;
int gvl = vsetvl(FIXEDSIZEBLOCK, SEW=32, LMUL=1);
...
a =  vsrl_vv_i32m1(b, c, 13);
a2 =  vsll_vv_i32m1(b, c, 19);
a = vor_vv_i32m1(a, a2); // emulate 32-bits rotation
a = vreinterpret_vi64m1_i32m1(vadd_vv_i64m1_vl(vreinterpret_vi32m1_i64m1(a), inc, gvl/2));

Were it rewritten thus:

vint32m1_t a, a2, b, c;
vint64m1_t inc, a3;
int gvl = vsetvl(FIXEDSIZEBLOCK, SEW=32, LMUL=1);
...
a =  vsrl_vv_i32m1(b, c, 13);
a2 =  vsll_vv_i32m1(b, c, 19);
a = vor_vv_i32m1(a, a2);
vsetvl(gvl, SEW=64, LMUL=1); // new vl should be half of gvl
a3 = (vint64m1_t) a; // bit by bit copy
a3 = vadd_vv_i64m1((a3, inc);
vsetvl(gvl, SEW=32, LMUL=1);
a = (vint32m1_t) a3; // bit by bit copy

Assuming that the casts are trivial, does this make sense?

rdolbeau commented 4 years ago

@ebahapo If FIXEDSIZEBLOCK is smaller than the max vector length for SEW=32,LMUL=1, then the SEW=64 vsetvl will give you a vector length > (gvl/2), so the 64-bits add will be too long - you don't want gvl here, you want gvl/2, why not ask for it?

Though again - you do re-specify SEW and LMUL in every intrinsic. Why not the requested VL as well? (... https://github.com/sifive/rvv-intrinsic-doc/issues/8#issuecomment-618891937).

Hsiangkai commented 4 years ago

On the original subject, what is that:

avl = vsetvli_i64m2 (n);
(...)
va = vadd_i32m1_vl (vb, vc);

Going to execute? The compute intrinsics re-specify two out of three implicit parameters (SEW, LMUL), but leave the third (VL) out... It doesn't make much sense to me not to specify the third as well. At least if the vadd() had a VL in it, the intent of the user would be clear...

As long as the ratio of SEW/LMUL is the same, the number of elements to process is the same. To change vtype under the same VL is normal, especially when users want to do mixed width operations under the same VL.

There is a semantic of vsetvl is designed for this purpose. You could refer to https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vsetvlivsetvl-instructions. When rd = 0 and rs1 = 0, vsetvl is used to "change vtype keeping existing vl".

rdolbeau commented 4 years ago

So if I understand correctly, this:

avl = vsetvli_i64m2(n);
(...)
va = vadd_i32m1(vb, vc);

should work as hoped for, but that:

avl = vsetvli_i32m2(n);
(...)
va = vadd_i32m1(vb, vc);

won't, because it's possible the new SEW/LMUL will be too big for the AVL that was set previously, as it was filling two registers and we now only have one...

Edit: remove spurious _vl in examples

rdolbeau commented 4 years ago

@Hsiangkai From the specifications you pointed out, does this mean that the non-VL-specifying form could be trivially emulated with the VL-specifying form by simply setting VL=0 in the VL-specifying form? i.e., vadd_i32m1(vb, vc) could be written instead vadd_i32m1_vl(vb, vc, 0). ` It would explicitly mean 'keep whatever was there before', and be easily implementable and removable by the compiler. The 'implicit VL' semantic would be retained, but it would be, in a way 'explicitely implicit' versus the 'explicitely explicit' of using a non-zero value.

Hsiangkai commented 4 years ago

So if I understand correctly, this:

avl = vsetvli_i64m2(n);
(...)
va = vadd_i32m1(vb, vc);

should work as hoped for, but that:

avl = vsetvli_i32m2(n);
(...)
va = vadd_i32m1(vb, vc);

won't, because it's possible the new SEW/LMUL will be too big for the AVL that was set previously, as it was filling two registers and we now only have one...

Edit: remove spurious _vl in examples

If the ratio of LMUL/SEW is changed, you need to consider VLMAX. In this case, VLMAX is reduced from 2 VLEN / 32 to 1 VLEN / 32. The vl will be changed implicitly in vadd_i32m1(vb, vc). It will operate on fewer elements than expectation.

You could refer to https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#constraints-on-setting-vl. "The vsetvl{i} instructions first set VLMAX according to the vtype argument, then set vl obeying the following constraints" and VLMAX is the upper bound of vl.

Hsiangkai commented 4 years ago

@Hsiangkai From the specifications you pointed out, does this mean that the non-VL-specifying form could be trivially emulated with the VL-specifying form by simply setting VL=0 in the VL-specifying form? i.e., vadd_i32m1(vb, vc) could be written instead vadd_i32m1_vl(vb, vc, 0). ` It would explicitly mean 'keep whatever was there before', and be easily implementable and removable by the compiler. The 'implicit VL' semantic would be retained, but it would be, in a way 'explicitely implicit' versus the 'explicitely explicit' of using a non-zero value.

We could use 0 as a special value for the purpose. However, we have no such special value in current proposal. We could simulate explicit vl intrinsics as

vint8m1_t vadd_vv_i8m1_vl(vint8m1_t vs2, vint8m1_t vs1, _VL_T vl) {
  vsetvl_i8m1(vl_extract(vl));
  return vadd_vv_i8m1(vs2, vs1);
}

If users give the parameter vl = 0, we will set vl to 0 exactly.