Clarify memory access behavior

christian-herber-nxp commented 4 days ago

from @ved-rivos:

This non-normative text "Therefore, implementations are not required to ensure atomicity when storing to memory. However, writing to both registers written by a 64-bit load must happen atomically to ensure fault handling is possible." implies that the 64-bit load needs to be atomic. This could be reworded to "For the purpose of raising misaligned-address or access exceptions, the address is considered to be naturally aligned if it is a multiple of 8. Even if naturally aligned, the memory access might not be performed atomically. If the address is a multiple of 4, then each 4-byte access is performed atomically. The LD instruction must however write the loaded data to the pair of destination registers atomically to ensure fault handling is possible."

christian-herber-nxp commented 4 days ago

@ved-rivos Thanks for the suggestion.

I have two comments on the proposed text:

It could be perceived that it is mandatory to perform naturally aligned loads as two 32b loads. We do not want to imply this, as there will be many implementation with a 64b data bus that can do this atomically.
FLD/FSD do not make such specific statements. They give the guarantee to execute atomically for XLEN>=64. Otherwise, it might not be atomic. It seems to me that breaking the access into bytes would be a legal implementation of D.
If we want to mandate that if cracked, 4B aligned instructions must be broken into 2x 32b accesses, then this should not be a note but normative text.

ved-rivos commented 4 days ago

When the address is 4 byte aligned we would want the implementation to guarantee that if the implementation cracks the instruction into two 32 bit accesses. Making this normative enables software to be able to rely on this. This would allow for an optimizing compiler using this instruction to load two adjacent 32-bit integer variables into the pair destination. We should not preclude this usage regardless of whether this usage will ever become common. FLD/FSD instructions load or store a single double precision floating point number. When XLEN < 64, the load or store of the single double precision floating point number is not atomic even if it was done using two atomic 32-bit loads. In this case requiring the each of the 32-bit accesses to be atomic would provides no additional benefit.

This does not require that an implementation not do a 64-bit load if it has a 64-bit data bus. Such an implementation would already satisfy the stated atomicity requirement and is not precluded.

tovine commented 4 days ago

Copying my comment from the other issue here. Note that I also missed this part yesterday, which did colour my impression:

For the purpose of raising misaligned-address or access exceptions, the address is considered to be naturally aligned if it is a multiple of 8.

I see this covers the question of trapping misaligned accesses, so my main concern is gone.

Original comment:

Section 2.1:

This non-normative text "Therefore, implementations are not required to ensure atomicity when storing to memory. However, writing to both registers written by a 64-bit load must happen atomically to ensure fault handling is possible." implies that the 64-bit load needs to be atomic. This could be reworded to "For the purpose of raising misaligned-address or access exceptions, the address is considered to be naturally aligned if it is a multiple of 8. Even if naturally aligned, the memory access might not be performed atomically. If the address is a multiple of 4, then each 4-byte access is performed atomically. The LD instruction must however write the loaded data to the pair of destination registers atomically to ensure fault handling is possible."

I think this sentence could be worded a bit differently:

If the address is a multiple of 4, then each 4-byte access is performed atomically.

To me this reads as if trapping a 4-byte aligned access as misaligned is not allowed, which is not the intention; forcing simple implementations with 64 bit data bus interface to crack these instructions into multiple micro ops (including buffering to do the atomic register update) would add quite a bit of extra complexity and make it less attractive to implement in those cases.

Maybe something like this would be better?

If the address is a multiple of 4, then each 4-byte part of the access is performed atomically.

christian-herber-nxp commented 4 days ago

When the address is 4 byte aligned we would want the implementation to guarantee that if the implementation cracks the instruction into two 32 bit accesses. Making this normative enables software to be able to rely on this.

I think here we need to drill deeper. Even if mandating it is broken into 2x 32b accesses, that does not imply anything about the latency of this instruction. It could really be slower than 2x LW. Without additional information, the compiler can only rely on the fact the 1x LD has smaller code size than 2x LH. This is the case independent on how it is decomposed.

This would allow for an optimizing compiler using this instruction to load two adjacent 32-bit integer variables into the pair destination. We should not preclude this usage regardless of whether this usage will ever become common.

This is totally realistic (although probably not too common). Again, this is possible no matter how the access is broken down. I also do not really see what difference that makes over a 64b datum which is 4B but not 8B aligned.

FLD/FSD instructions load or store a single double precision floating point number. When XLEN < 64, the load or store of the single double precision floating point number is not atomic even if it was done using two atomic 32-bit loads. In this case requiring the each of the 32-bit accesses to be atomic would provides no additional benefit.

This does not require that an implementation not do a 64-bit load if it has a 64-bit data bus. Such an implementation would already satisfy the stated atomicity requirement and is not precluded.

We are somehow not yet on the same page. Let me try to get closer:

Do you want to mandate that a 4B aligned LD access never traps? (related to @tovine in https://github.com/riscv/riscv-zilsd/issues/29#issuecomment-2205874436_)
- If yes: Shouldn't we then change the natural alignment to 32b?

I feel what is wanted is to state that if an implementation performs a 4B load that is 4B aligned on a part of the 64b data, it may not be broken down any further. It would still be allowed to trap and it would still be possible to load 64b atomically.

I recall that the ARC posted some ideas on how to standardize such behavior with some new extensions a few weeks ago. This was the original message: https://lists.riscv.org/g/tech-privileged/message/1990

If I understand you and the proposal correctly, what you would like is to enforce Zilsm4b for Zilsd?

ved-rivos commented 3 days ago

I also do not really see what difference that makes over a 64b datum which is 4B but not 8B aligned.

In this use case its not a 64b datum. Its two 32b (e.g., struct{ uint32_t a; uint32_t b;}) datum that are being loaded using this instruction into a pair of registers. The compiler can do that if it can rely on each 32b datum being loaded or stored atomically.

Do you want to mandate that a 4B aligned LD access never traps?

For purposes of alignment checks the access is an 8B access. Implementation is not required to support non 8B aligned accesses using these instructions. If 8B alignment is not required and if the address is a multiple of 4B then the access must be performed as either 1x8B atomic access or 2x4B atomic accesses.

christian-herber-nxp commented 3 days ago

In this use case its not a 64b datum. Its two 32b (e.g., struct{ uint32_t a; uint32_t b;}) datum that are being loaded using this instruction into a pair of registers. The compiler can do that if it can rely on each 32b datum being loaded or stored atomically.

Unless that struct is declared volatile, the compiler should be able to this also if the hardware is breaking it into byte accesses, or what would be blocking it from doing so? What about RV64? LD is not required to be broken into two 32b accesses if aligned to 4B but not 8B?

I understand that it makes sense Zilsd is implemented this way, and I think this is anyway the likely behavior. I am just questioning whether we should mandate that (or leave it to an additional extension like Zilsm4b to specify the behavior).

ved-rivos commented 3 days ago

If the accesses can be broken into byte access then they are not atomic for multiprocessor consistency. In RV64, the LD loads a single 64-bit data value and it is not required to be atomic if it is not 64-bit aligned and it does not load into a pair of registers. The use case here is of using the LD instruction to load a pair of 32-bit data values into a pair of registers using a single instruction but preserving the property that would be achieved if they were otherwise loaded using two 32-bit loads. We would like this property to be provided by these instructions to not preclude the use case of a compiler using these instructions to load or store 2x32-bit data values i.e., rely on the load or store of each of these 32-bit data values to execute atomically.

christian-herber-nxp commented 3 days ago

I cannot say I agree but I can make the change if it is the whish of the ARC. I feel we are complicating things by putting too specific requirements that do not have significant impact

I added a draft pull request which I believe would satisfy your requirement. I do think this is now in conflict with the section that describes the software view of the instruction, because this states accesses are bytes or any groups of bytes.

feel free to also continue the discussion on the PR.

ved-rivos commented 3 days ago

The section 2.5 also needs to be updated to be in sync with the PR. The PR looks good to me but I have left a comment.

christian-herber-nxp commented 2 days ago

When updating the PR, I had to think: are we not overcomplicating things? I do not disagree that it is good to split 4B aligned accesses into 2x 32b accesses, simply for performance reasons. Honestly, it is hard to believe an implementation will choose to do something different. I do not quite share the concerns related to atomicity. When this is truly of concern, you will likely have to make use of the A extension.

Currently, it was easy to describe the software view. There are almost too many scenarios to efficiently elaborate (depending on data bus width and which kinds of accesses trap.

Finally, I don't think compilers would really gain much, as they would not know if the misaligned instructions will trap and what will happen in that trap (likely it will be byte accesses).

ved-rivos commented 2 days ago

The section 2.5 does not state more than what this PR and previous sections of the specification already states. To avoid repeating the specification, my suggestion is to drop the section 2.5. About atomicity - the compiler should be able to rely on load or store of two aligned 32-bit variables in same way it can rely on them being atomic if two separate LW were used. The responsibility then rests on the compiler to aligned its data structures.

christian-herber-nxp commented 2 days ago

the compiler should be able to rely on load or store of two aligned 32-bit variables in same way it can rely on them being atomic if two separate LW were used

not quite. because it is legal for the hart to raise a misaligned exception on 4B but not 8B aligned accesses, can the compiler really rely on what will happen?

ved-rivos commented 2 days ago

This specification update guarantees: "If the address is a multiple of 4, then each 4-byte access is performed atomically." The compiler cannot rely on the 8B access being atomic - even if naturally aligned. However it can rely on:

If address is 8 byte aligned then the two 4 byte accesses will be atomic
If address is 4 byte aligned then the two 4 byte accesses will be atomic
Otherwise the instruction provides no atomicity guarantees.

christian-herber-nxp commented 2 days ago

Ok. In my mind, the hardware is allowed to raise an address misaligned exception if an access does not follow the natural alignment. If we want to overrule this, we should be very explicit.

Isn't it the cleaner solution to just state that the natural alignment of these instructions is 4B?

ved-rivos commented 2 days ago

The hardware may raise an address misaligned exception if address is not 8 byte aligned: "For the purpose of raising misaligned-address or access exceptions, the address is considered to be naturally aligned if it is a multiple of 8". The resultant trap, if an exception was raised, may either lead to an emulation of the instruction by the trap handler or termination of execution. When it leads to an emulation, the emulator is required to follow the specification i.e. if the address is 4 byte aligned then it must emulate using two LW instructions and such emulation is transparent to software. Stating that a doubleword memory operation is aligned if address is word aligned is not very clean.

christian-herber-nxp commented 2 days ago

got it. i updated the PR, and kept the SW view section, as otherwise i think it is too easy to understand something else.

ved-rivos commented 2 days ago

Looks good to me.

riscv / riscv-zilsd

Clarify memory access behavior #34