Encode ELEN to sub-extension, Zve64, Zve128, Zve256...

kito-cheng commented 4 years ago

Compiling or writing vector code should know the ELEN, but currently we didn't encoding the info anywhere.

Of cause we can add a new field on RISC-V ELF attribute, but I think better way is encoding that in the arch string, so that we can using -march instead of adding new option to assembler/compiler to indicate the ELEN.

Proposal for the sub-extensions: Zve32 is not needed since ELEN should large or equal to 32. Zve64 implied if rv64 or D-extension is present. Zve128 implied if rv128 or Q-extension is present. ...

hanna-kruppe commented 4 years ago

More precisely, the static value that must be known to compilers (and is useful for assemblers, linkers, and loaders for diagnostic likely errors) is the minimum value of ELEN required by the code, as binaries should (generally) be portable to machines that support larger element widths, too.

This value MinELEN is part of the ABI. That is, two object files compiled with different MinELEN values are generally ABI-incompatible, at least once vector function calls enter the picture. Extensions seem like the wrong tool to model this, in the same way that D subsumes F but -mabi=lp64d is still incompatible with -mabi=lp64f.

kito-cheng commented 4 years ago

For floating point, it's OK to combine two objects for different minimum FLEN requirement but same ABI, the result is the minimum FLEN will increase into largest value amount the objects.

e.g. rv32ifd / ilp32f link with rv32if / ilp32f is OK, but the minimum FLEN requirement of the final object would become 64 (or rv32ifd)

So I think it's same situation for ELEN, Zve64 link with Zve128 is fine, but the you will got an object require Zve128 instead of Zve128, which can't run on the machine only support ELEN=64.

But I guess your concern is come from the argument and return value for vector value, it's a little different situation from the floating point, we didn't define ABI for different ELEN yet, and I think we don't need different ABI for ELEN, the point is you can't using vector type with SEW 64 but have no Zve64, so there is no way to break the ABI even link objects with different ELEN, just increase the minimum ELEN requirement is fine.

hanna-kruppe commented 4 years ago

It's true that there's no vectorcall ABI definitions yet, but I expect that whatever vectorcall ABI we end up with will need to depend on MinELEN. That is because:

It must be possible to mix vectors with different element sizes but equal number of elements, e.g. one vector of 32 bit integers and another vector of 8 bit integers both having VLEN/32 elements.
That means we need supports for vectors where num_elems * elem_width is less than VLEN, i.e., vector types that do not completely fill even one vector register. By far the most sensible way to handle that is to sign- or zero-extend the elements to match the largest element width, as that's the only option that makes the vector elements line up correctly and is supported by the ISA.
How such a smaller vector is laid out in a register can't sensibly depend on "context" such as what other vector element types occur in the function (that may not even be visible in the signature), so the best choice we have is always extending the elements to MinELEN. And that means a caller working with MinELEN = 32 can't pass such a vector type to a callee working with MinELEN > 32 or vice versa.

kito-cheng commented 4 years ago

It must be possible to mix vectors with different element sizes but equal number of elements, e.g. one vector of 32 bit integers and another vector of 8 bit integers both having VLEN/32 elements.

I think the solution is using different LMUL/type to deal with this instead of promote the type, because promote type also mean you need to increase the LMUL.

Let's using type system in EPI's proposal for further discussion :)

If i understand correctly, there is would be some type can't passing in argument if we promote vector type, e.g. we can't passing __epi_16xi8 under ELEN=64, since it require __epi_16xi64, but there is no LMUL 16.

And promotion also cause some performance penalty , you need to promote it before function call, and then narrowing the value type before using the value.

So I think promotion is not necessary on vector type, and then there is no ABI issue between different ELEN.

I really want to avoid ABI combination for different ELEN, one reason is e_flags is very limited resource, we need take 3 bits if we decide encode ELEN on ABI.

hanna-kruppe commented 4 years ago

The EPI proposal actually gives a much more clear-cut example of the ABI depending on MinELEN: Every single type that exists under both MinELEN=32 and MinELEN=64 (not all do) has an ABI that depends on the exact value of MinELEN. For example, __epi_4xi16 exists under both, but under MinELEN=32 it's an LMUL=2 register group while under MinELEN=64, it's one (LMUL=1) register. Not only do they use registers differently, they also have different sizes, so not even passing in memory will help.

I really want to avoid ABI combination for different ELEN, one reason is e_flags is very limited resource, we need take 3 bits if we decide encode ELEN on ABI.

I agree that having multiple vector ABIs is very undesirable (not just for e_flags encoding space, but also for binary compatibility) but I don't really see a way around it.

Finally, to clarify my original point:

If i understand correctly, there is would be some type can't passing in argument if we promote vector type, e.g. we can't passing __epi_16xi8 under ELEN=64, since it require __epi_16xi64, but there is no LMUL 16.

__epi_16xi8 is a perfectly sensible vector type to have on its own, as it corresponds to filling one vector register with as many 8 bit integers as you can. You can't write (or auto-vectorize) a program using __epi_16xi8 if you also want equal-length vectors with 64 bit elements, as no such type exists (I suppose one could emulate it but nobody's seriously proposing that). But outside of that situation, __epi_16xi8 can and will be used frequently and so it definitely should be supported by any vectorcall ABI.

The types I was talking about are not actually included in the EPI proposal. Extrapolating their naming scheme, an example of such a type would be __epi_2xi8. I propose that such a type should be implemented by sign- or zero-extending to __epi_2xi32 (I misspoke earlier, there's no reason to extend to MinELEN, just to a large enough width that the whole vector has VLEN bits). This would not just happen at ABI boundaries, it's also a perfectly fine representation choice for such vectors within functions (the same way integers < XLEN are handled in scalar code).

This representation requires extra extensions/truncations where they are needed for small integer types in the scalar ISA and on loads and stores. But if these types are needed, then the only other alternative is keeping the narrow elements packed in the lowest-numbered lanes instead (effectively only ever using the lowest half/quarter/... of the register). As discussed in the EPI intrinsics document, this has a different cost, namely that it requires shuffles for any widening or narrowing operation. I don't want to write a whole study here comparing these approaches, but briefly I expect that code that has a need for these types is very likely to do widening and narrowing operations on them, and the cost of shuffles for those seems worse than the cost of sign-/zero-extension for other operations.

Either option can also be performed manually, removing the need for compilers and ABIs to worry about it, but there are some nice optimizations that can be done if they are built-in (e.g. removing redundant re-extension of arguments and return values). It would make sense to not support these types if that would allow us to make the ABI independent of MinELEN, but as pointed out above there are other (more fatal) obstacles to that.

kito-cheng commented 4 years ago

The EPI proposal actually gives a much more clear-cut example of the ABI depending on MinELEN: Every single type that exists under both MinELEN=32 and MinELEN=64 (not all do) has an ABI that depends on the exact value of MinELEN. For example, __epi_4xi16 exists under both, but under MinELEN=32 it's an LMUL=2 register group while under MinELEN=64, it's one (LMUL=1) register. Not only do they use registers differently, they also have different sizes, so not even passing in memory will help.

It sounds like it because the type system design in EPI proposal? what if we have consistent type system between different ELEN? like vintm_t scheme, vint16m1_t to present SEW = 16 and LMUL = 1, so that we can decoupling the type system from the ELEN?

The types I was talking about are not actually included in the EPI proposal. Extrapolating their naming scheme, an example of such a type would be __epi_2xi8. I propose that such a type should be implemented by sign- or zero-extending to __epi_2xi32

I thought VL and mask is the solution for deal with element less than MAXVL?

This would not just happen at ABI boundaries, it's also a perfectly fine representation choice for such vectors within functions (the same way integers < XLEN are handled in scalar code).

I think it's fine to have different presentation in function between different functions, like rv32ifd/ilp32f and rv32if/ilp32f, rv32ifd using FPR to compute double value and rv32if using GPR and libcall to compute the double value, but they all following same rule to passing argument / return value, so it's fine to work/link together.

One reason scalar code need to extensions/truncations is because RISC-V don't provide native operation for those types like 8/16 bits integer, so we need to extension data to XLEN and then truncation back to make sure the value is not out of range, just notice, here is talking about code gen level not language standard level.

But we have native 8/16 bits operation on vector extension, so I think extension/truncation is not needed for compiler code gen unless user explicit require that, e.g. programmer use lb in SEW 32, it result data must extend to 32 bits from 8 bits data.

Thanks :)

hanna-kruppe commented 4 years ago

(Apologies for the very late reply.)

It sounds like it because the type system design in EPI proposal? what if we have consistent type system between different ELEN? like vintm_t scheme, vint16m1_t to present SEW = 16 and LMUL = 1, so that we can decoupling the type system from the ELEN?

The EPI proposal is designed that way for good reasons, and the same problem exists even if we name the user-facing types differently.

In both LLVM and GCC, the number of elements or bytes in a variable-sized vector is represented as compile-time constant multiple of an integer factor that's not known until runtime, but assumed to be fixed across all all parts of the program and throughout any given execution of the program. In LLVM, this factor is called vscale. GCC has a more general support for quantities that are a linear combination of multiple unknowns (poly_int), but since different unknowns are not comparable, that generality is of no use for dealing with VLEN alone and I'll ignore it from here on.

In these frameworks, the only way to distinguish different vector types (aside from their element types) is by the constant multiples of the single unknown factor. For example (focusing on element counts, byte sizes are similar), vint16m1_t has twice as many elements as vint32m1_t, so the former has c1 * x elements and the latter has c2 * x elements where x is the unknown factor and c1 = c2 * 2.

We need to pick one unique meaning for x that makes these relationships work for all types we care about, and we don't have much freedom: we cannot have fractional multiples of x (e.g. there's no 0.5 * x), so have to make x something small enough that the smallest quantities we need to represent amount to 1 * x. For SVE, this leads to defining x as the number of 128-bit chunks in an SVE register. For RISC-V, it depends on MinELEN:

If we only use element widths up to 32, the smallest quantity is the number of elements in vint32m1_t (= VLEN/32), so that's our x and every vector type VT has c * x elements for c = (elements in VT) / (elements in vint32m1_t).
But if SEW=64 is an option too, x must be the number of elements in vint64m1_t and every vector type with SEW <= 32 has its constant c scaled up by a factor of 2 compared to the previous scenario.

Thus, as in the EPI type system, the interpretation of the same source-level vector type changes depending on MinELEN. Note that we cannot erase these distinctions by defining x "as if" SEW takes on the largest possible value we anticipate (e.g., pick the second option above even when compiling for MinELEN = 32). Since x must itself be an integer, this would imply that vectors of smaller elements have more elements than architectually guaranteed. For example, if we define x = VLEN/64 then vint32m1_t has 2 * x elements so it has at least 2 elements, but that's not true on an implementation with VLEN = ELEN = 32.

So the MinELEN-dependent differences the EPI intrinsics type system surfaces are inherent to the way GCC and LLVM reason about variable-sized vectors internally. While it's debatable whether that should be exposed in the user-facing type names, no matter what names you expose to the user, they will map to different (incompatible) types within the compiler depending on MinELEN.

At first glance these differences exist only in the compiler's internal data structures and should not influence the ability to link translation units together, but LTO operates on IR. So if we have one translation module compiled for MinELEN = 32 and another compiled for MinELEN = 64, doing LTO between them would mix two different interpretations of x: what was supposed to be the same vint32m1_t vector type in both translation units are now different types at the IR level.

It is tempting to try to to fix this up during LTO by rewriting the vector types in the modules compiled for a smaller MinELEN, but the constant factors occurring in these vector types can and will "leak out" and deeply influence the surrounding code in ways that are impossible to reverse-engineer and adjust after the fact. At least in the LLVM community, many people were very concerned about this and made the addition of scalable vector types dependent on side-stepping these issues by requiring x (or in LLVM terms, vscale) to be the same value across all parts of the program (even across translation units). I don't know if the GCC community has had the same discussions but in any case, their approach is similar enough that all the same problems can occur there.

(I focus on LTO here because I assume nobody wants to lose the ability to do LTO, but do note that similar problems could still occur even without LTO. They're just more convoluted so I will refrain from making this comment even longer by describing them.)

So in conclusion, I do not see how (current, mainstream) compilers can support linking between translation units with different MinELEN settings. Well, we could define a platform profile that requires that VLEN is e.g. a multiple of 128 and on such platforms link MinELEN={32,64,128} modules together (because we could then define x = VLEN/128), but that would not be compatible with implementations with smaller VLEN, where we'd still have the problem.

kito-cheng commented 4 years ago

@hanna-kruppe thanks you share the compiler implementation detail especially for the LLVM part, I am working on GCC, so I am not family with the LLVM internal so much.

We are using and implementing type system like vint<ELEN>m<LMUL>_t, so from the language extension view, MinELEN isn't matter for such type system.

But I understand your concern is about the compiler implementation, so I think it would be better to make sure I understand correctly before further discussion, I'll try to describe the problem in my word again, let me know if something I misunderstand :)

Assume we are using ELEN=64 type system:

|  Type Name |     # of element |
|            |                  |
|  vint8m1_t | vscale x 8 x  i8 |
| vint16m1_t | vscale x 4 x i16 |
| vint32m1_t | vscale x 2 x i32 |
| vint64m1_t | vscale x 1 x i64 |
| vint64m2_t | vscale x 2 x i64 |
| vint64m4_t | vscale x 4 x i64 |
| vint64m8_t | vscale x 8 x i64 |

So point is: vint32m1_t = <vscale x 2 x i32> are weird or kind of definition violation for LLVM language under MinELEN = 32.

because accroding the definition in LLVM Language Reference: <vscale x 2 x i32> means vector with a multiple of 2 32-bit integer values.

But it could be just one element under VLEN=ELEN=32 configuration, so in this case we should substitute vscale to 0.5 for modeling that right, but the value would be unreasonable for <vscale x 1 x i64>, it would be evaluate to a non-integer value.

I assume vscale never evaluate during compilation time, so I am not sure why it would be a problem? do you mind share your experience and thought?

In our current GCC implementation experience, the size or # of element more like just a placeholder to identify the type and represent the corresponding relation between different vector type, since the polynomial won't evaluate during the compilation time.

And I also interested your thought about fractional LMUL, I guess it would be a big problem on current EPI's type system?

https://github.com/riscv/riscv-v-spec/issues/376

Last, a little off topic, the term of MinELEN is little confusing me :P because it seems more like MaxELEN in the compiler implementation view, we use that to canonical the # of unit/element, so in fact it decide by the largest element type in the vector type system, not smallest type, but I think it not affect this conversation/discussion.

Thanks :)

kito-cheng commented 4 years ago

@Hsiangkai Do you mind update the status about LLVM, I believe current LLVM implementation is not a problem, especially for LTO issue.

kasanovic commented 3 years ago

Has this issue been resolved, and do the new extension names in commit https://github.com/riscv/riscv-v-spec/commit/808a6f83b72d92757ef4c93fcdf076ed99bbecae help?

kito-cheng commented 3 years ago

@kasanovic Thanks, I saw spec require VLEN >= 128 and EEW must support 64 for V now, I think it good enough for vector intrinsic implementation :)

riscvarchive / riscv-v-spec

Encode ELEN to sub-extension, Zve64, Zve128, Zve256... #360