riscvarchive / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
966 stars 273 forks source link

ordering scalar load and vector load to same address #551

Closed kasanovic closed 3 years ago

kasanovic commented 4 years ago

PoR requires these are ordered by program order.

This might want to be loosened to simplify implementations.

David-Horner commented 4 years ago

Date: 2020/9/25 Task Group: Vector Extension Chair: Krste Asanovic Co-Chair: Roger Espasa Number of Attendees: ~14 Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

551 Memory consistency model for scalar loads and vector loads

In current PoR, RVWMO memory model requires that scalar loads and vector loads from same hart to same address are ordered following program order. Proposal is to weaken this requirement so that scalar loads and vector loads to the same address can be reordered, simplifying implementations, except for ordered gathers. In particular, the requirement for a younger scalar load to not occur before an older vector gather to same address requires that the scalar load wait (or speculates) to determine vector gather addresses.

Discussion centered around how much of an impact this would have on software, and on constructing a case where the change would impact software. In almost all cases where the scalar access is used to read a signaling value from another hart, a FENCE would anyway be required for correct operation as the synchronization would be associated with the communication of more than one atomic word of memory. Only in the case where the signal is part of an atomically written word of memory (8 bytes max in current spec), and where the vector read is used to read the same word (perhaps as a vector of bytes) might this cause an issue. This was felt to be relatively rare.

Another worry is when a routine with a sync operation based on a scalar read of a signaling variable then calls a routine, where the subroutine is separately compiled and reads the data including the signaling variable using vectors, there is a possibility that the vector read will return inconsistent data. In general the caller is unaware of whether the routine uses scalar or vecor reads, and the subroutine is unaware that the variable was used to communicate between threads.

While modern programing languages require that access to variables used to communicate between harts be annotated to ensure correct compilation, in practice legacy code and incorrect code might fail to include the correct annotations and have a latent bug.

It was noted there are two directions for the ordering.

sl -> vl: Older scalar load before newer vector load, and vl -> sl: older vector load before newer scalar load

The sl->vl direction represents the signaling-value-check before vector computation case and is easiest to implement in hardware as vector instructions typically access memory later in the pipeline than scalar instructions.

The vl->sl case is the difficult one to implement at high-performance but is also easier for software to work around with some form of read fence (either FENCE or ordered vector access or just scalar read of affected address).

The sentiment was in favor of weakening the memory ordering constraint but more discussion was needed. Potentially only the vl->sl constraint could be weakened.

I am in favour of effectively weakening the scalar/vector vector/scalar load/load order requirement.

However, this cannot be performed in isolation without regard to the rest of the RVWMO dependency requirements.

RVI has section 14.3 Source and Destination Register Listings, 5 pages detailing , identifying and categorizing dependencies between implictly and explicitly opcode identified persistent stores, including csrs.

These dependencies form a critical component of the RVWMO specification.

They constrain global memory order for memory data entering and exiting a potentially lengthy sequence of non-memory accessing instructions.

They are also based on an intuitive engine: the hypothetical device that executes instruction is program order, the "hart".

The rules and constraints are crafted to accomplish results that are "strong enough to support programming language memory models".

For Vector extension, we have not yet stipulated what the Vector specific RVVWMO requirements are. This is a necessary step, it will be instrumental in shaping or tempering the explicit WMO constraints.

To me the pivitol question is what execution model does the Vector Engine follow. Does it need to be constrained to support legacy programming language memory models? Should it rather be envisioned as a novel model freed from past bondage, or if not to that extreme some of those constraints?

A [simple/comprehensive] specific conceptual vector model may eliminate a swath of RVWMO rules. Specifically, idealizing the vector processor as distinct from the hosting "hart", as an autonomous co-processor as far as Memory order is concerned. This functions conceptually as a set of independent "hardware threads" coordinating among themselves, and also collectively to the host hart to cause the required vector behaviour.

I believe "register" dependency must still be considered, at the element level and not solely named registers.

We should not profess to be "RVWMO- except vl -> sl, and except sl -> vl ( except when ordered indexed reads), and in-order precise execution trapping except ..., and ....)"

Rather we must define a model that intuitively allows all the optimizations we believe are necessary for a first class Vector design.

kasanovic commented 3 years ago

Group decided to stay with active RV MCM (RVWMO, RVTSO) for inter-instruction ordering. Any relaxations would be in additional extensions.