fix instruction synchronization bug on a real RISC-V processor

@kaashoek Hi, I think the bug reported by #5 is still a problem. I am trying to explain in detail.

This bug is related to instruction cache (ICache) in hardware.

Process A loads its code to a new physical page P0 in xv6-riscv/kernel/exec.c. We assume that P0 is not in the ICache before loading the code. Note that from the aspect of hardware, loading code is similar to an invocation of memcpy(). This means that loading code does not cause such code to enter ICache, since it only involves ordinary load and store instructions. For a processor with data cache (DCache), the code may be even located in DCache.
A executes code. Since P0 is not in the ICache, this will cause ICache miss. ICache will get the correct code from DCache or from memory.
A exits, and P0 is reclaimed.
Now a new process B is loaded, and B get exactly the same physical page P0 to load its code. Remember that the loaded code is not in the ICache after loading.
B executes code. Now disaster happens. When B is going to access ICache, it may get a hit since the code of A is still in the ICache! This causes B to execute wrong code!

The key to avoid such disaster is to update ICache every time new code is loaded. Such update can be performed in either hardware or software.

For the hardware method, hardware will examine the address of every store instructions to see whether ICache is also holding the same cache block. If it is the case, ICache will invalidate such block it holds. This guarantees that ICache will not get a hit next time it access the same address. x86 is such case.
For the software method, we should execute some special instructions to explicitly update the state of ICache. This operation is also called synchronization according to the RISC-V manual. In RISC-V, there are three special instructions belong to this type.
- SFENCE.VMA. This instruction is used to update the state of hardware components related to virtual memory, such as TLB. Note that SFENCE.VMA is NOT guaranteed to update ICache. It depends on the hardware implementation.
- If the ICache is implemented as PIPT (physically indexed, physically tagged), it is nothing to do with virtual memory and unnecessary to be updated under SFENCE.VMA.
- Even though the ICache is implemented as VIPT (virtually indexed, physically tagged), it is still unnecessary to update ICache under SFENCE.VMA. This is because the index field of an address keeps the same after address translation.
- But if the ICache is implemented as VIVT (virtually indexed, virtually tagged), it is necessary to update ICache under SFENCE.VMA. This is because the tag is from virtual address, and the mapping to the physical address may be changed after SFENCE.VMA.
- FENCE. This instruction is used to synchronize the visibility of store instructions before the FENCE itself. Every load instructions on other CPUs after FENCE should see the result of store instructions before FENCE. Note that FENCE is defined to only guarantee the visibility to load instructions. FENCE does NOT guarantee the visibility to instruction fetches. Therefore, executing FENCE will not guarantee to update ICache.
- FENCE.I (different from FENCE). According to the RISC-V manual,
```
Currently, this instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its instruction fetches.
```

Therefore, the solution to the bug above is FENCE.I. The discussion about memory barrier at the end of section 9.3 in the text book is about FENCE. This discussion is still irrelevant to such bug, since memory barrier is talking about ordinary load instructions, but not instruction fetches.

We encounter this bug when we are trying to run xv6 on a simple in-order RISC-V processor designed by undergraduates. There are some buffers and caches in the processor. The table below shows the behavior of the three instructions discussed above (`F` = flush, `K` = keep).		ICache	DCache	TLB
SFENCE.VMA	K	K	F	F
FENCE	K	K	K	K
FENCE.I	F	K	K	F

Since TLB and BTB is indexed by virtual address, they should be flushed under SFENCE.VMA. ICache and DCache are simply implemented as PIPT, so they are not flushed under SFENCE.VMA. For FENCE.I, ICache and BTB will be flushed, since they are related to instruction. Note that the processor is in-order, the behavior required by FENCE is naturally satisfied. Therefore FENCE can be implemented as nop and does not flush anything. This processor can successfully boot Linux and Debian, but fails to run xv6 without applying the patch #5 to fix this bug.

The bug is not exposed in QEMU. This is because in QEMU, all buffers and caches related to instruction are virtually indexed. They will be flushed under SFENCE.VMA. xv6 will always execute SFENCE.VMA on context switch. At this time, the code cache (a key component to implement JIT) in QEMU will also be flushed. Compared to our simple RISC-V processor, the main difference is that there is a PIPT ICache in the processor, which is not affected by SFENCE.VMA and FENCE.

Welcome for further discussion. :)

mit-pdos / xv6-riscv

fix instruction synchronization bug on a real RISC-V processor #67