VM fault fastpath new benchmarks

alwin-joshy commented 2 years ago

New benchmarks for testing VM fault performance - converted some of the tests from seL4test to benchmarks (can later be extended for other types of benchmarks) as well as added new round trip + mapping benchmark (AARCH64 only).

Three new benchmarks were added, with each of them measuring the following scenarios:

Performace of VM fault -> VM fault handler
Performance of VM fault handler -> VM faulter
VM fault round trip including cost of remapping page from read only to read write (Only available on AARCH64 - could extend to other architectures but my feeling is that this one is mostly unnecessary as a combination of the first two tell much the same story).

The first two of these are for measuring either direction and the third is scenario of memory management to see the improvement for a real use case.

Both of the first two benchmarks are adapted from seL4test. The way that they work is that the faulter does an invalid memory access which causes a VM fault to occur. The fault handler then updates the register set of the faulting thread such that upon reply, it will restart at a different instruction (as opposed to reattempting the same faulting instruction), and proceed normally.

These benchmarks are all created such that they satisfy the conditions that are required by the vm fault fastpath to pass, such as the fault handler being a passive thread, the faulter having a lower priority than the handler, and so on.

lsf37 commented 2 years ago

Something seems funky with the commits, the PR contains some that are already in master. Can you try rebasing your branch over master?

alwin-joshy commented 2 years ago

Something seems funky with the commits, the PR contains some that are already in master. Can you try rebasing your branch over master?

Think that's fixed now, something weird happened when I was trying to fix an overly long commit message.

lsf37 commented 2 years ago

Think that's fixed now, something weird happened when I was trying to fix an overly long commit message.

Thanks, that is better. The commits now unfortunately all have two sign-off lines (both from you, but with different addresses). Please fix to one. Also, the commits should have a body message that explains what is going on and gives background/reasoning for the change. The style fixups to your own changes should be squashed into the commit that makes the changes, similarly any other minor fixups. As far as I can see, this PR should probably be one or two commits in total.

It's fine for reviewing for now, but should be fixed up before we can merge.

I'm not the right person to review the content on this one, so will need to leave this to someone else, but more and better benchmarks sounds like a good idea to me. Commented on minor style things inline.

lsf37 commented 2 years ago

The files with only empty lines added/removed should be a separate commit. Usually we don't do these unless we also touch the code, but I think these specific ones are not a problem. For those commits it's Ok to leave out a body message and call them trivial: formatting or something like that.

axel-h commented 2 years ago

Could you add a description for the scenario that this implements?

gernotheiser commented 2 years ago

On 10 Feb 2022, at 10:02, Gerwin Klein @.***> wrote:

Happy with the waiting as such, just wanted to check on the way it's done. @kent-mcleod might know more about the standard way this is done in sel4bench/sel4test.

The methodology presently employed by sel4bench is unsound: Excessive writes leading to stalls that interfere with the operation to be benchmarked. Adding a pause between iterations allowing the writes to drain is a workaround that is ok for now, but eventually the overall approach needs to be fixed.

alwin-joshy commented 2 years ago

Could you add a description for the scenario that this implements?

Edited the PR description to add more details. Is there anything else you would like me to include?

kent-mcleod commented 2 years ago

The methodology presently employed by sel4bench is unsound: Excessive writes leading to stalls that interfere with the operation to be benchmarked. Adding a pause between iterations allowing the writes to drain is a workaround that is ok for now, but eventually the overall approach needs to be fixed.

@gernotheiser can you be more specific? My understanding is that sel4bench has always used tight loops and lots of startup iterations to try and minimize cache misses during measurements. In this particular case it appears that most of the writes would be coming from writing TCB register context data into TCB objects by the kernel and not due to overhead writes? In any case, pausing for a certain number of cycles would allow things like store buffers to clear but the pause time should at least become a controlled variable of the benchmark that's maybe varied to show the effect on latency due to stalls?

gernotheiser commented 2 years ago

Can’t look at the code right now (very poor connection) but from memory a lot of data is written by the benchmarking rig to a buffer to collect statistics, which, on some processors, exceeds the memory bandwidth leading to stalls. This results in large standard deviations due to a bimodal distribution of execution times, when the actual kernel operations measured should be highly deterministic and properly measured, standard deviations should be a few cycles.

I don’t see a need to vary the pause amount, as it has nothing to do with what’s supposed to be measured, the stalls are a pure artefact of the measurement approach. A more appropriate methodology would just accumulate data in a few variables to compute core statistics after the timed loops. I believe @malus-brandywine is working at properly fixing this issue.

kent-mcleod commented 2 years ago

Can’t look at the code right now (very poor connection) but from memory a lot of data is written by the benchmarking rig to a buffer to collect statistics, which, on some processors, exceeds the memory bandwidth leading to stalls.

For this fault benchmark the only statistic that gets written out for each iteration is the number of cycles taken for that iteration as a 64bit value. An additional 64bit variable is used to hold the start time during the operation. I don't see how this would be dominating the storage bandwidth when there's many more registers that get saved during a context switch.

alwin-joshy commented 2 years ago

Can’t look at the code right now (very poor connection) but from memory a lot of data is written by the benchmarking rig to a buffer to collect statistics, which, on some processors, exceeds the memory bandwidth leading to stalls.

For this fault benchmark the only statistic that gets written out for each iteration is the number of cycles taken for that iteration as a 64bit value. An additional 64bit variable is used to hold the start time during the operation. I don't see how this would be dominating the storage bandwidth when there's many more registers that get saved during a context switch.

I have run a the benchmarks again (odroidc2 mcs on + off - this was the platform where I observed the irregular results) with and without the pause, and there appears to be no significant difference between the two in results. Maybe the results I was observing earlier were a result of some mistake I had made in the benchmarks but resolved later on without double checking if the pause was necessary.

lsf37 commented 1 year ago

Hey @alwin-joshy where are we with this PR? Would you be able to rebase and resolve conflicts? Are there any open issues?

alwin-joshy commented 1 year ago

I'll get it cleaned up by tomorrow and let you know

alwin-joshy commented 2 weeks ago

@lsf37 Finally had some time to look into this and have addressed the issues.

The PR adds the VM fault benchmark for aarch32/64, x86_64 and RISC-V. There is also a separate commit that properly closes #20, as the fault and hardware benchmarks seem to now be supported for RISC-V (can move this to a separate PR if more appropriate).

seL4 / sel4bench

VM fault fastpath new benchmarks #18