Don't assume cache line sizes are fixed during program execution

... not even from one instruction to the next one.

Any scheme where software reads the cache line size from a CSR or queries to OS for it and then remembers it is prone to failure as soon as you have multi-core heterogeneous systems with different cache block sizes and process migration between them.

It doesn't make any difference whether the cache line size is queried once at the start of the program, before every loop, or even immediately before/after the CMO instruction. The process can and eventually will get migrated at exactly the wrong time.

This issue caused actual extremely hard to reproduce and debug system crashes on one model of mobile phone in a previous job. The phone contained standard ARM "LITTLE" cores and company designed "big" cores. The cores had different cache line sizes. When the problem was diagnosed ARM was asked how they dealt with SoCs with cores with different line sizes. Their answer "We don't do that!"

I think it's an entirely reasonable thing to do and should be allowed for in the design of CMOs intended to be used in cores from many organisations over a long period of time.

My suggestion is that the actual CMO instruction should return the number of bytes it operated on for that particular execution -- and hence the amount the pointer should be advanced by.

If the address given for the CMO is in the middle of a cache line then the return value should be the number of bytes in the rest of the cache line, to allow software to align the pointer to the cache line start for the next iteration.

In the case of a destructive operation such as DCBZ the hardware could choose whether to zero the partial block and report that as normal, or the return value could could somehow indicate that nothing was done. Software could then either ignore it (if it doesn't really care whether the contents are zero or not and the line is either already in the cache or else will be fetched as usual when next referenced) or else manually zero those bytes. The most natural way to indicate this might be to return the negation of the number of bytes that should have been operated on but weren't. Or perhaps set the high bit, which would allow an unconditional & 0x7FF or similar for software that doesn't care (would fail if cache lines can ever be 2K or more).

NB this can be needed on any iteration, not only the first, if the process is migrated from a core with a small cache line size to a core with a large cache line size.

This is a good point. Because of migration, you do want an atomic way of determining how much data was operated on. Reading a CSR and executing a CMO is not atomic, even if you read the CSR before, after, or both.

I would generally advocate for returning the number of bytes operated on, independent of the start address, however. This allows a simple implementation to return a constant rather than put an incrementor/adder in the result path. For most designs (?), the result would represent an aligned-block of memory that contains the address (but I suppose some SW might want a different interpretation?). SW would be responsible fixing up the next address (most likely just adding the return value).

This works for certain use cases, but wouldn't work if you wanted to operate on a precise number of bytes (e.g. memzero).

I have a lengthy proposal both on instruction shape and on a way to deal with time-variant cache line sizes that will be getting dropped in the group a bit later this week (a quick pass is being done now to remove spelling/grammar mistakes). The proposal provides for time-variance where it's safe and well defined to do so.

Look for this proposal sometime this week.

Additionally, the cache line size / number of bytes affected may also differ at different levels of the cache hierarchy, or different caches at the same level (e.g. Instruction versus Data caches).

I wonder whether a CMO returning the number of bytes affected is too CISC-y.

Could software instead do the Right Thing (TM), i.e. functionally correct but simpler and slower, if the cache line size discovery mechanism returned a mask of all the cache line sizes in use in the coherence domain?

It seems like "differing cache lines sizes in a system" overstates the issue. All the caching agents within a coherence domain need to understand one common coherence granule that coherence protocol actions are in terms of (at least for virtually all commercial coherence protocols). Within that domain there may be caches with larger or smaller line sizes. For caches with smaller lines sizes, they still need to perform coherence protocol actions (request, snoops, etc.) in terms of the coherence granule size (e.g. give up two cache lines when a request or snoop for ownership is received). For caches with larger line sizes, they either have sectors equal in size to the coherence granule or they again must privately deal with the mismatch between their local cache line size and the size of coherence protocol actions. Put differently, a hart and its cache can locally perform CMO's of that cache's line size, but all that has to be locally and privately reconciled with all resulting global coherence protocol actions being in terms of the coherence granule size.

Where the problem can creep in is when code loops through a series of CMOs with an initial cache line size stride length, and then that code migrates to a hart with a smaller cache line size. But if CMOs are instead defined in terms of the coherence domain-wide coherence granule size, and software uses a stride length equal to that coherence granule size, then everything can work out alright. In particular, the hardware of any hart with a larger or smaller cache size must already understand at a hardware level the coherence granule size for the coherence domain it is participating in, and should perform CMOs effectively to that coherence granule size (or larger).

I expect the counter-argument to all this is that people want to have non-coherent hart caches that depend on software to manage coherency. Such as arises with ARM big.Little systems that have non-coherent instruction caches and potentially differing cache line sizes. But is that what this group is trying to cater to (especially since RISC-V starts off with a bias or expectation that hart instruction caches are hardware coherent)? Versus providing CMO's to handle data sharing between coherently-caching harts and other non-caching agents (e.g. DMA masters wanting to do non-coherent I/O to/from memory).

If the answer is the former, then one solution (albeit sub-optimal) could be for all software to assume the smallest cache line size in the system. (Or Derek's coming proposal probably has a better solution.) But is this type of system design, and with differing non-coherent hart cache line sizes, the tail that's wagging the dog?

OK, I'll stop there - having stirred the pot enough.

Greg

Thanks for the thorough reply Greg! Yours are always worth the read.

the hardware of any hart with a larger or smaller cache size must already understand at a hardware level the coherence granule size for the coherence domain it is participating in, and should perform CMOs effectively to that coherence granule size

I worry that this places an additional complexity cost or configuration restriction on composing hardware blocks into a system, and that this is avoidable by exposing the range of cache lines / coherence granules in a system to software.

Sure, if the same IP block hard macro can be instantiated with same-or-different next-level cache line size, then add inputs to it.
Sure, if a chip can be hot-plugged to another chip with same-or-different cross-chip cache line size, then either assume it's always there, or discover and program it on the fly.

So yes, it is solvable in hardware, but it is hardware's (and hardware verification's) burden, and I'm brainstorming ways to shift that burden to software (RISC vs CISC again).

I'll loosely use ARMv8 terms for a contrived example, with the instruction DCZVA writing to a NAPOT memory block of size reported by DCZID. Greg, you suggest:

In a system with DCZID reporting 64-byte memory block size but running on hardware with underlying 32-byte cache line size, hardware executing a DCZVA instruction would need to zero two cache lines. Do-able, but making a single instruction span multiple cache lines isn't free.
In a system with DCZID reporting 32-byte memory block size but running on hardware with underlying 64-byte cache line size, hardware executing a DCZVA instruction would need to zero a sub-cache-line sector. Do-able, but tracking cache sectors isn't free.

I suggest that we augment DCZID to report both 32- and 64-byte memory block and cache line sizes in the system, and the DCZVA instruction is defined to operate NAPOT on at least 32 bytes and at most 64 bytes. All of these instruction sequences should be agnostic of whether they are operating on either the 32-byte or 64-byte cache lines.

Software wishing to zero 32 bytes would see that range as smaller than 64 bytes (the maximum cache line size), and execute store instructions:
```
STR XZR, [#0x00]
STR XZR, [#0x08]
STR XZR, [#0x10]
STR XZR, [#0x18]
```
Software wishing to zero 64 bytes would see that range as evenly divisible by 64 bytes (the maximum cache line size), and execute DCZVA instructions at 32-byte granularity (the smallest cache line size). This is relying on hardware with underlying 64-byte cache line size gathering the back-to-back DCZVA instructions, which isn't free, but is easier than spanning cache lines or tracking sectors.
```
DCZVA [#0x00]
DCZVA [#0x20]  ; this is redundant on 64-byte cache lines and should gather combine with the previous instruction
```
Software wishing to zero 96 bytes would execute DCZVA instructions at 32-byte granularity (the smallest cache line size) for the first 64 bytes (the maximum cache line size), then execute stores for the remainder.
```
DCZVA [#0x00]
DCZVA [#0x20]
STR XZR, [#0x40]
STR XZR, [#0x48]
STR XZR, [#0x50]
STR XZR, [#0x58]
```

I wonder if this portable software (tolerant of multiple cache line sizes) would have worked as a software work-around for the specific case that @brucehoult encountered!

Aside: you mentioned I-Cache software coherency, but I think Derek and the J-Extension are leading that, so this riscv-CMOs group is focusing on CMOs to handle data sharing, per https://github.com/riscv/riscv-CMOs/wiki/CMOs-WG-Draft-Proposed-Charter.

riscv / riscv-CMOs

Don't assume cache line sizes are fixed during program execution #8