Caching for MMU translations

Here is the proposed MMU caching scheme as discussed also here: https://github.com/sysprog21/semu/issues/26

This is likely not yet final and needs to be discussed. As said, the savings at -O3 are not great, YMMV. If you add extra test delays to RAM access, you can hopefully see that this does as it should.

Given that the MMU logic is one of, if not the most involved parts of the code, I have a couple of questions & doubts about my own code still:

there is a separate patch for adding a #cycles-run limit to the emulator for performance testing
I have written an adhoc emulator RAM dump at the end (not in this PR) and did a bit for bit comparison and it looks like the cache does not change anything it terms of how the current Linux image boots (first 300Mcycles of the current kernel + basic buildroot till login prompt on a 16MiB RAM semu)
I have noticed that the RISCV MMU spec allows for an even simpler MMU implementation in the single-core case, which does not update any access bits upon lookup. More specifically, I have noticed that if I comment out this part from riscv.c:
```
if (new_pte != pte)
    *pte_ref = new_pte;
```
that nothing changes in terms of runtime behaviour (again, did a bit-for-bit check of the output RAM at the end of ~300Mcycles) of the emulator. What gives? Is that to be expected? Is that simply how Linux assumes everything to be if on a single RISCV32 core? If so, is that something that should potentially be conditionally #ifdef-ed out to optimize for single-core use of the emu? I must really admit that I have no great oversight of how it all fits together with the exact Linux MMU accesses yet, I am really a bit out of my depth here at the moment :D
Likewise, I added a couple of MMU cache invalidations wherever I saw the potential need for them and I assumed that they are necessary (so as to not expose kernel translations to user space, for example) whenever the execution mode changes between user and supervisor mode. Is my set of invalidations sufficient? Is it too much? Or actually correct?
(As an aside: having the EMU exported to python or another scripting language would make all the high level stuff like cmdline parsing and such test jigs for e.g. a regression test doing RAM bit-for-bit checks against a 'golden RAM dump' so much simpler ...)

I have noticed that the RISCV MMU spec allows for an even simpler MMU implementation in the single-core case, which does not update any access bits upon lookup. More specifically, I have noticed that if I comment out this part from riscv.c:
if (new_pte != pte)
        *pte_ref = new_pte;
that nothing changes in terms of runtime behaviour (again, did a bit-for-bit check of the output RAM at the end of ~300Mcycles) of the emulator. What gives? Is that to be expected? Is that simply how Linux assumes everything to be if on a single RISCV32 core? If so, is that something that should potentially be conditionally #ifdef-ed out to optimize for single-core use of the emu? I must really admit that I have no great oversight of how it all fits together with the exact Linux MMU accesses yet, I am really a bit out of my depth here at the moment.

Given the current heavily simplified MMU implementation, the page table walker might not even traverse into such cases, so I could potentially use tools like stress-ng to generate memory stressing workloads and determine the necessity. Initially, we should consider leaving some FIXME/TODO comments in the source files. Would you agree?

sysprog21 / semu

Caching for MMU translations #28