Open kasanovic opened 4 years ago
For opaque save/restore code (i.e., where code only moves data and does not try to interpret it), it does not matter which EEW is used provided the same is used for both, and the restore can not know what type the register represents.
For the case where a vector register spill and restore are in same scope with known EEW, then having only a single store of SEW=8 penalizes big-endian machines. For big-endian, would prefer that stores also encoded EEW, so that a whole register spill/fill sequence could use correct EEW and avoid an internal data rearrangement on use of refilled register.
To better support big-endian systems with wide vector data paths and internal data rearrangement, we might need to define whole-register stores with encoded EEW also, and also that EEW has architectural effect on loads.
Vector library code cannot in general be ported unchanged between big-endian and little-endian systems, so this proposal would not create a new problem.
TL;DR; I see this as a v1.1+ issue. Why even support big-endian vector?
Why support big-endian vector? Big-edian machines could still run a little-endian V implementation.
As mentioned, at the byte level the endianess converges. That means O/S use of byte for clear/set/copy are transparent. This is likely the OS default (with class level enhancements).
However, even at higher granularity, these operations are transparent, as noted: when SEW does not change.
SEW change as it affects code byte order is a relatively small (but important) segment of the code base. A typical case is transparent porting from another architecture. If the other architecture is little-endian, such porting is a non-starter. The basic assumption of memory to register order is violated.
Perhaps a Big-endian V variant came be devised, but I see it as a v1.1+ issue.
Perhaps, "Categorize issues" should be first on the agenda?
On further reflection, I think that big-endian machines should just hold values in byte order in vector registers, and that ALU operations should operate on operands using big-endian byte order, i.e., big-endian 16b add would treat byte 0 of vector register as MSB of value of element 0. This has no cost for single-endian machines and means that big-endian machines can do the same tricks with type-casting of vector register value as little-endian machines. Bi-endian machines would have to permute ALU operand bytes depending on selected endianess, but vector load/store pathways would remain identical between endianesses (I think).
Hi Krste,
I don't understand the implications of what you are saying here. Is the effect of it that the entire vector register bank is either big or little endian? Is your comment just about whole-register operations (the title of this issue) or all loads and stores?
Imagine a scenario where there is a buffer of 4-byte big-endian values that I want to convert to little endian. Using what I previously presumed to be the case (endian-swap on load/store), I could envisage doing this transformation on a bi-endian machine by:
Is this sequence still valid in your modified implementation? If not, what is the best way to it?
Also, what is the effect of dynamically changing endianness on vector registers that contain index values for indexed loads and stores? Does switching endianness effectively corrupt these indices?
Thanks.
Tagging for post v1.0.
Big-endian implementations do not have the property that bytes in a vector register are held in the same order as in memory, and changing SEW will expose big-endianness. The whole-register move instructions have EEW encoding to help machines with internal data rearrangement, and this can be treated mostly as a hint/ignored in little-endian machines and implemented using existing unit-stride load implementation with this EEW. However, on a big-endian machine, the whole vector register load/store cannot reuse regular load behavior with EEW for whole-register move as hint as register might have been stored with different EEW.
To be consistent, the instructions should be defined to always move bytes, regardless of EEW. This still allows little-endian implementation to reuse load behavior and rearrange data based on load EEW hint, but means big-endian machines will almost always have penalty of dynamic data rearrangement given that data must be reloaded as SEW=8.
For little-endian machines, this is a non-issue (#470).