jart / blink

tiniest x86-64-linux emulator
ISC License
7k stars 224 forks source link

Lots of time spent in memory subsystem #153

Open jonashaag opened 1 year ago

jonashaag commented 1 year ago

Not sure if this is expected, and if anyone is interested in optimizing this.

I have a real world workload that spends a lot of time in the memory subsystem. macOS Instruments profile:

Screenshot 2023-09-04 at 09 23 20

Unfortunately I can't share the workload itself but I can do more profiling or try patching some stuff. I've already figured out that caching some of the machine-related checks (if (m->foobar ...)) speeds up things by 10%

jart commented 1 year ago

Are you using the linear memory optimization? It should be enabled on most platforms by default, unless you're disabling it by passing the '-m' flag. If I'm running on Linux x86-64 then I have a near certain chance of getting fast memory. Profiling will look like this:

image

But if I pass the -m flag to blink to disable the linear memory optimization, then profiling will look the way yours does:

image

jonashaag commented 1 year ago

Sorry, should have specified the exact command. Yes, I'm using -m because other the program won't work.

jart commented 1 year ago

Have you read these sections of the readme?

The reason why -m is costly is because it does full memory virtualization. It has to indirect memory accesses through a translation lookaside buffer and a four-level radix trie. It's about as optimized as it can be.

The best bet for you would probably be to find some way to get the linear memory optimization working for you. For example, we could find some other formula for mapping guest addresses onto host addresses.

  1. Are you using Apple Silicon? Because if so, Apple doesn't let us mmap() addresses beneath 4gb
  2. Is your program a static binary? Which fixed addresses was it compiled to use?
jart commented 1 year ago

You're also invited to join our Discord https://discord.gg/Hb4QHYj2