sysprog21 / semu

A minimalist RISC-V system emulator capable of running Linux kernel
MIT License
253 stars 47 forks source link

Route MMU through RAM access methods, caching? #26

Open onnokort opened 1 year ago

onnokort commented 1 year ago

Hi,

I just ported your great little emu to a very unlikely architecture and whilst doing so, I had to tweak the MMU code to go through the RAM access methods as well. I propose the code would be a lot more portable if MMU RAM access calls corresponding methods/functions as well, so that RAM access can really be anything. The current way of having a pointer into emu RAM is a bit of a breach of abstraction IMO.

I further noticed that every CPU instruction fetch goes through the MMU page walk and a very simple "1-entry-1-page" cache (one for INSN fetch, load & store) speeds up the emulation considerably. For the standard PC build, I got from about 10s to boot the emulated Linux to about 7s just by caching the current page for mmu_fetch.

As my code to do so is quite peculiar and adapted to the odd target architecture, I don't think it really fits well as an upstream patch. Still, I like to raise the issue of the MMU being direct-access here.

jserv commented 1 year ago

@RinHizakura, Could you provide a comment on the MMU issue mentioned above, drawing from your past experience in developing the riscv-emulator?

RinHizakura commented 1 year ago

For the part of the instruction cache, it can really help a lot for system emulators. In my experience, I have icache.c on my risc-v emulator, and it improves performance even though the implementation is super naive as you can see. riscv-rust has a better implementation of page cache if we want one to reference.

In conclusion, the instruction cache would be an important component for optimization, but it is also worth mentioning that the timing to invalidate the cache for system emulator should be considered carefully for correctness.

jserv commented 1 year ago

I further noticed that every CPU instruction fetch goes through the MMU page walk and a very simple "1-entry-1-page" cache (one for INSN fetch, load & store) speeds up the emulation considerably. For the standard PC build, I got from about 10s to boot the emulated Linux to about 7s just by caching the current page for mmu_fetch.

As my code to do so is quite peculiar and adapted to the odd target architecture, I don't think it really fits well as an upstream patch. Still, I like to raise the issue of the MMU being direct-access here.

I appreciate the outstanding C64 porting work carried out by @onnokort! Introducing a cache for MMU manipulation is a logical step towards enhancing performance.

Furthermore, uc-rv32ima has reported the following:

To improve the performance, a simple cache is implemented, it turns out we achieved 95.1% cache hit during linux booting.

The cache implementation in uc-rv32ima is both straightforward and effective. We may consider employing a similar technique in this project.

onnokort commented 1 year ago

I already implemented a proposal for a very simple n-entry ring buffer style LRU MMU Cache. Let me just clean the code up and make a PR. With -O3 on x86_64 as a host, the savings are in the end (at three levels deep which seems to be the optimum for this kind of cache) just a couple percent, though. I think the question whether an MMU cache is a good idea or not is mainly one of where you want to go with this project: If you think MMU accesses can always be implemented like they are now, with direct RAM pointers, then I think the savings are not so great. (I haven't tested the new MMU LRU cache code on the 6502 port yet - I expect it so save a lot more there, though).

But besides the fun C64 port, I actually had more serious (but very tentative) ideas along the lines of having it all exported to a python library that would allow things like fuzzing network applications with a full (RISCV) Linux underneath it, doing quick emulator state checkpoints and having some kind of 'differential RAM' (I don't know whether a more specific term exists for what I have in mind) where the changes from a checkpointed state are expressed just in some kind of persistent data structure. Make instruction and I/O traces to build test jigs that check for identical, deterministic code behavior etc.

On the other hand, I am for example personally not so interested in multi-core scenarios but saw that you already prepared parts of the code for that.

But an integral part of any aforementioned ideas would be to have a very simple RISCV CPU object which on the top has the typical "step()" or "run(..)" etc. and itself on the underside requires just a load(addr) and a store(addr, value) pair of methods to interface with the outside world, with all the MMU access going transparently through these calls. I would also like that to be modular enough to be able to either configure a MMU or RAM cache (or not!) during runtime or at least with a couple #defines in code. Maybe modularizing the code so that all the MMU functions go into their own exchangeable module is already most of what would be needed. Does that make sense?

What I really like about this project is the minimalism and simplicity, I like simple and being able to understand stuff but still being able to deal with applications of any complexity on the guest side. As far as I looked, this is the simplest emulator that can still run a full Linux with virtual memory.

So I guess it all really depends on what this should become.

jserv commented 1 year ago

I would also like that to be modular enough to be able to either configure a MMU or RAM cache (or not!) during runtime or at least with a couple #defines in code. Maybe modularizing the code so that all the MMU functions go into their own exchangeable module is already most of what would be needed. Does that make sense?

Certainly, modularization and refactoring efforts are greatly appreciated. Feel free to submit pull requests for these changes.

What I really like about this project is the minimalism and simplicity, I like simple and being able to understand stuff but still being able to deal with applications of any complexity on the guest side. As far as I looked, this is the simplest emulator that can still run a full Linux with virtual memory.

As a university faculty member teaching RISC-V and system programming, I was searching for pre-existing implementations that could facilitate the execution of (almost) unmodified Linux images with an active MMU. Unfortunately, I couldn't find an exact match for my requirements. As a result, my students and I have taken on this project. While it might not have the level of polish seen in professional solutions, it is reasonably functional and serves its purpose effectively.

Once again, contributions are always welcomed. I believe that maintaining a minimalist approach will be beneficial for adaptive development, facilitated by effective code reviewing.

jserv commented 2 months ago

arv32-opt s a port of mini-rv32ima on atmega328p (the core of Arduino UNO, a 8-bit AVR microcontroller). So basically, this code is for booting Linux on Arduino UNO. It has 3 512-bytes cache (1 icache and 2 dcache interchangeable) and lazy/delayed cache write system.