seL4 / sel4bench

sel4 benchmarking applications and support library.
Other
18 stars 30 forks source link

bench/signal: early processing methodology #26

Closed malus-brandywine closed 1 year ago

malus-brandywine commented 2 years ago

Implementation of "Early Processing" methodology for "Signal to High Prio Thread" benchmark.

Benchmarks have to collect considerably large sequence of measurements to process them in a later phase. Processing comprises calculation of distribution parameters.

Initially introduced methodology sequentially saves the measurements in an array. In some cases it may cause D-cache line evictions and following D-cache line refills that occur inside the measurement period.

Early processing methodology accumulates four paramaters which necessary to calculate mean, variance and standard deviation, during collection phase, and drops individual measurements.

Signed-off-by: Nataliya Korovkina malus.brandywine@gmail.com

gernotheiser commented 2 years ago

This should really be a configuration option for all benchmarks, not just one, as it addresses a general issue

pingerino commented 2 years ago

IIRC the page that collects results was mapped write through. The idea was we'd only append data, never read, and use up 1 TLB slot. Was this measured to be incorrect?

malus-brandywine commented 2 years ago

@pingerino, thanks for checking. Short answer is "not quite", detailed answer is the following.

Point 1

The benchmark suite works with sel4_libs libraries during allocation memory for the results. The allocation function accepts from benchmark application only "cacheability" flag.

Going deeper into the library code, the memory attributes that are passed to kernel API functions are set to "Arch Default" (defined in sel4_libs code as well).

"Arch Default" attributes for aarch32 and aarch64 concern only cacheability. "Arch Default" attributes for x86 are set to "zeros".

We might have more flexibility if we skip the library and invoke API directly, however, there's Point 2.

Point 2

Looking into the kernel.

For Arm, page mapping function checks only "cacheable" and "executable" flags coming as attr parameter. It sets "write-back" write policy and "read and write" allocation policy always for "cacheable" allocations. There's no point to switch to calling kernel API.

For x86, page mapping function sets write policy depending on the parameter passed. "Arch Default" for this case is set to 0 which chooses "write-back" policy. It might work here to switch to direct API calls.

Risc-V is not mentioned here, I'm not familiar with the architecture (yet). I'd like to ask help from the community with it.

The solution is definitely make sense for Arm architecture.

lsf37 commented 2 years ago

Can you post result differences for the benchmarks with and without early processing? The reasoning sounds consistent, but early processing also introduces its own measuring error (maybe more consistently), so it'd be good to see what the difference in reality is.

If it does make a strong difference, we probably really should consider doing this for all of them.

malus-brandywine commented 2 years ago

Sure, @lsf37. I have the tables of comparison.

For clarification, "late processing with delay" are measurements taken with empty loop introduced last year as a remedy against "1-in-8" pattern. For the moment, it can be ignored.

Regarding results for Arm, I did see problems caused by increased code size. However, I made the code "flat", so we have decent measurements now. In Armv8, there are some runs with higher std.dev, I take it as a ground for further improvements.

For other architectures, currently we see poorer results than for "late processing". There are two scenarios possible: 1) we can introduce the methodology for x86 & risc-v with perspective that the code will be improved for them (the choice of methodology is a config option anyway) 2) introduce the methodology for Arm only

Arm Tables

Risc-V Tables

x86 Tables

pingerino commented 2 years ago

@pingerino, thanks for checking. Short answer is "not quite", detailed answer is the following.

Point 1

The benchmark suite works with sel4_libs libraries during allocation memory for the results. The allocation function accepts from benchmark application only "cacheability" flag.

Going deeper into the library code, the memory attributes that are passed to kernel API functions are set to "Arch Default" (defined in sel4_libs code as well).

"Arch Default" attributes for aarch32 and aarch64 concern only cacheability. "Arch Default" attributes for x86 are set to "zeros".

We might have more flexibility if we skip the library and invoke API directly, however, there's Point 2.

Point 2

Looking into the kernel.

For Arm, page mapping function checks only "cacheable" and "executable" flags coming as attr parameter. It sets "write-back" write policy and "read and write" allocation policy always for "cacheable" allocations. There's no point to switch to calling kernel API.

For x86, page mapping function sets write policy depending on the parameter passed. "Arch Default" for this case is set to 0 which chooses "write-back" policy. It might work here to switch to direct API calls.

Risc-V is not mentioned here, I'm not familiar with the architecture (yet). I'd like to ask help from the community with it.

The solution is definitely make sense for Arm architecture.

It has long been an issue that we cannot set these cache attributes from user level. The kernel benchmark memory area should be mapped write through, ideally we could hack another benchmark function in to allow this in the short term.

That would be less work that rewriting every test. And as @lsf37 noted, the reasoning for keeping the processing out of the specific benchmark apps was to keep them as small as possible and also allow for the raw measurements to be exported easily for analysis when summary stats aren't sufficient. I thought about this carefully when I designed it this way.

Iff we cannot truly set appropriate cache attributes for the shared page on an architecture, then a config option is a decent solution.

malus-brandywine commented 2 years ago

My next comment will be for @gernotheiser .

If we introduce the new methodology for Risc-V and x86, there will be more changes in the code. So, before extending the methodology on other benchmarks, I would wait for the code improvement for current one.

malus-brandywine commented 2 years ago

@pingerino, thank you for the details!

malus-brandywine commented 2 years ago


I have further comments on the topic.

1. Write-through policy with "late processing" methodology

A reply to Anna's comment:

It has long been an issue that we cannot set these cache attributes from user level. The kernel benchmark memory area should be mapped write through, ideally we could hack another benchmark function in to allow this in the short term.

That would be less work that rewriting every test.

I made a small fix in sel4_libs/libsel4utils code to check how setting "write-through" attribute changes the measurements for x86 (for which kernel reads write policy attributes from user space). The answer: nothing changes for both MCS and non-MCS kernels.

So, for now I wouldn't bother to allow user of sel4_libs/libsel4vspace and sel4_libs/libsel4utils to pass the attributes for mapping operations. However, it can be enhancement opportunity for the future.

2. "Late processing": improvements of results

I'd like to clarify output of current "late processing" methodology. There are 3 types of abnormal patterns, 2 of them: increased N first and 2-3 last measurements - "head pattern" and "tail pattern".

The first one is caused by prolonged warm-up process. The second one is caused by Arm's value prediction feature (spotted in Armv8). Very likely. At the moment I have no means to prove it, but the tail does not depend on a number of measured samples.

So, to counteract those two patterns we can just cut warm-up and tail samples off and record measurements in the stable phase. I'm going to implement it after we finish with current commit.

Report with description of the raw data is presented here

3. "Early processing": integration

The third pattern we see in current methodology output is "1-in-8": every 8th sample is considerably higher. The pattern was spotted only on Cortex-A9 bench. "Early processing" methodology effectively solves the problem.

Generally, it is good idea to get rid of sequential writes and use 4 register variables for collecting statistics, but we have to deal with compiler optimizations and micro-architectural processes, so the code will be processor-dependent. The presented version of the code was adjusted for Arm7 & 8, so we see improved measurements for the both of them. I assume, for other two architectures "early processing" code will look differently.


- per architecture

My suggestion is to allow the presented code for Arm architectures (I will add conditional compilation).

When we have implementations for x86 & Risc-V we will add the "early processing" option for them as well.


- per benchmark

I would certainly stay away from "bamboo flowering" event of adding the new code across all the benchmarks.

We can expect improvements of measurements in other benchmarks no matter 1-in-8 pattern is seen or not, but a series of measurements with the new algorithm should be made first.

The presented code can be a good starting point for further developments.


- participation

I have spent quite amount of time in this task, so I personally would not go beyond what I have presented so far.

(Short of curing the "head" and "tail" patterns, of course, as I promised)

~

axel-h commented 2 years ago

Adding some note from the recent hangout session:

gernotheiser commented 2 years ago

General comments

Any sort of profiling interferes with what you're trying to measure, and a good methodology aims to minimise this interference.

The present setup of seL4bench writes a lot of data which interferes with the measurements by adding memory stalls on architectures with low memory bandwidth. This is the "1-in-8" pattern on the Sabre @malus-brandywine refers to. We've also seen significantly increased latency in every second run (1-in-2) on the HiFive (RISC-V). These issues manifest themselves in high standard deviations.

Any standard deviation larger than about 5 is highly suspicious and indicates that the mean cannot be trusted. We have that on a fair number of sel4bench marks across different architectures.

Incidentally, on a platform with really low memory bandwidth you may experience a stall on each iteration, leading to a consistently increased latency, so you'd get the wrong result without even being warned by a large standard deviation.

Clearly the present methodology is poor and needs improving.

Incidentally, it is not clear whether mapping the log buffer uncached helps. That approach trades reduced interference with the L1 cache for increased memory traffic. This will only help if there is sufficient memory bandwidth and if the unit of writes is an integer multiple of the unit of updates in RAM (possibly by the RAM being write-combining), which is definitely not universal. So this may help in some cases, but is an architecture-specific workaround instead of a general solution.

So, in summary, the present methodology is not appropriate for measuring performance. It works in some cases, but gives misleading results in others. The log buffer is still useful for analysing performance, as long as you understand the limitations, specifically look at the data in the context of the real results.

How do you get the "real" results? The answer is the "early processing" implemented by @malus-brandywine. Unsurprisingly, this is the approach you get taught in Physics labs – for good reason. It avoids producing large amounts of data to extract and thus being affected by memory bandwidth.

I don't see where early processing, done correctly, would "introduce its own measuring errors". The benchmarking rig needs at least one D-cache line (for the counter), unless the compiler keeps it consistently in registers. (Whether or not this happens is irrelevant, as loop overheads will be deducted.) Early processing adds 2 more variables, sum(n) and sum(n^2), 3 overall, which fits in a cache line on all contemporary architectures I'm aware of (as long as n and sum(n) are 32-bit ints). As long as care is taken to keep these cache-aligned (probably best done by putting them into an aligned struct) there is no increased cache interference. It's a clean, generally applicable approach that makes minimal assumptions on architectural details. It's the one that should be used for calculating the core performance metric across all marks and platforms.

As I said, having the log buffer is still useful for analysis (with caveats). So both should be supported, selectable by a build option, but the early-processing variant is what should be used for performance reporting.

pingerino commented 2 years ago

Ok, it's now clear to me that the early processing does show a difference in measurement and has been measured - this was not apparent with only the context of the initial PR. This is a welcome improvement and yeah, we should keep both options but default to early processing.

"Clearly the present methodology is poor and needs improving."

I don't appreciate this kind of tone and IMO it brings down the quality of discussion and does not align with the values we want as a community. Let's not degrade prior work but appreciate it, as well as the fact that most things can do with improvement and none of us are perfect. We are all resource constrained and things are built within a context that is hard to see when only viewing the code after the fact. Calling things poor discourages future contributions and makes our community seem unfriendly, judgemental and cruel.

This benchmark suite is the product of much blood, sweat and tears by many team members. Yes, the current implementation can be improved and this PR should be merged.

I'll now take a look at the code in a bit more detail to give feedback to get this merged.

malus-brandywine commented 2 years ago

Thank you for the feedback everyone, will fix it the following few days.

gernotheiser commented 2 years ago

"Clearly the present methodology is poor and needs improving."

I don't appreciate this kind of tone and IMO it brings down the quality of discussion and does not align with the values we want as a community. Let's not degrade prior work but appreciate it, as well as the fact that most things can do with improvement and none of us are perfect. We are all resource constrained and things are built within a context that is hard to see when only viewing the code after the fact. Calling things poor discourages future contributions and makes our community seem unfriendly, judgemental and cruel.

Hi @pingerino, rather than "judgmental and cruel", what I made is a factual statement. The existing methodology produces objectively wrong results in many situations (for reasons I explained), and is therefore objectively a poor one (I could justifiably have called it "wrong"). And as it came out of my lab, I am as responsible for it as anyone.

Stating the facts is not degrading/judgmental, it's the way science works. I would have used the same terminology in a scientific publication. Refusing to call a spade a spade is exactly what leads to the recent trend of giving pseudo-science "equal airtime" with science. So please, let's just stay factual.

mbrcknl commented 2 years ago

@gernotheiser The language we use to talk about people's work really does make (and indeed has already made) a big difference to whether they come back to do more, and can also affect whether people come to do work in the first place. You are the chair of the board of the seL4 Foundation, so I really hope you can understand this. I would be happy to discuss this further, but not on this pull request.

Thank you @malus-brandywine for your very fine contribution, and thank you @pingerino for your previous work on the benchmarking framework.

lsf37 commented 2 years ago

@gernotheiser The language we use to talk about people's work really does make (and indeed has already made) a big difference to whether they come back to do more, and can also affect whether people come to do work in the first place. You are the chair of the board of the seL4 Foundation, so I really hope you can understand this. I would be happy to discuss this further, but not on this pull request.

Not to drag this out on this PR (I would also be happy to continue this elsewhere if needed), this is a situation that is not exactly uncommon in general in technical communication, so I do want to point out some things, because others might find them useful (not necessarily all for Gernot, I'm sure he knows this, and I hope he'll be Ok with me dissecting some of it -- take it as example that knowing it is not always enough):

gernotheiser commented 2 years ago

Early processing adds 2 more variables, sum(n) and sum(n^2), 3 overall, which fits in a cache line on all contemporary architectures I'm aware of ...

I spoke nonsense here. Of course, this is meant to be [n, sum(t), sum(t^2)], where t is the measured latency.

malus-brandywine commented 2 years ago

Early processing adds 2 more variables, sum(n) and sum(n^2), 3 overall, which fits in a cache line on all contemporary architectures I'm aware of ...

I spoke nonsense here. Of course, this is meant to be [n, sum(t), sum(t^2)], where t is the measured latency.

Technically, there are 2 more variables the algorithm collects: "min" and "max".

I have double-checked all the architectures' asm code today, all the variables are "register" (!). But I found potential point to improve the code. Will get back with the results the next couple of days.

pingerino commented 2 years ago

Another thought: do we want to use a config option or run both variants of the benchmarks? If the individual results are insigtful even with more perturbations it's better to run them all.

Mainly because explosion of config options often leads to code bit rotting, or way too many configuration combinations that need to be run in CI/CD

lsf37 commented 2 years ago

Neither run looks particularly expensive, so for now it would be perfectly fine to run both in CI.

Would need a bit of a refactor, but it would mean the config option could be used additively instead of if/else as it is now, which would be slightly more flexible, at least if there is also one that can independently switch on/off the old run.

Not sure it's worth it, I don't have strong feelings either way. Slight preference for running both as long as it's cheap and useful, but we might as well merge this first and then refactor for config options in a second PR.

gernotheiser commented 2 years ago

Technically, there are 2 more variables the algorithm collects: "min" and "max".

I have double-checked all the architectures' asm code today, all the variables are "register" (!). But I found potential point to improve the code. Will get back with the results the next couple of days.

I would leave the min/max off for the early-processing variant, as it avoids cache pollution and will add nothing you can't find out with the late-processing variant. Furthermore, as long as the standard deviation is tiny they'll be virtually identical (and if stdev isn't tiny then you know there's something wrong).

malus-brandywine commented 2 years ago

I'm gradually making the requested changes.

The latest commit improved measuring loop: "min", "max" and excessive bitmask solution was removed. That made the loop free of memory accesses, except of the mandatory one: a read of a volatile variable updated by other thread. Only register variables are at play now. Also the code got reduced.

For ones who is curious, I posted asm code of the measuring loop for 4 processors here: LoopCodeImproved-05.09.2022.md

The tables with the new measurements are coming.

malus-brandywine commented 2 years ago

I collected the measurements for the 4 processors (20 runs with MCS kernel).

The table is in BenchmarkData-05.10.2022.pdf

The latency for Armv7&8 got larger after extra "load" operation has gone. We see the same phenomena as we saw with empty loop in late processing code (which I mentioned in my April report). The empty loop with volatile counter generated number of loads/stores which facilitated cache operations; latency improvement by the empty loop had nothing to do with clock cycles elapsed between individual measurements.

So, for now the code seems to be the least influential while staying CPU-agnostic.

gernotheiser commented 2 years ago

something is not right here (if I understand correctly that this uses the early processing approach).

Standard deviations are generally around 10%, except for the HiFive where they generally small, but occasionally also big.

This should be a completely deterministic scenario that should have single-digit standard deviations. The only source of (apparent) non-determinism I can imagine are a significant number of budget expiries.

What are the parameter settings of the signaller's scheduling context? I assume the high thread is passive?

malus-brandywine commented 2 years ago

The only source of (apparent) non-determinism I can imagine are a significant number of budget expiries. What are the parameter settings of the signaller's scheduling context?

Both threads are Round Robin with Budget and Period 5 ms.

I assume the high thread is passive?

Not really. Changing to passive thread will make significant changes to the current code. I will close this PR.

I'll take into account the reviews received here.

pingerino commented 2 years ago

I don't see how signalling a passive thread will make a difference?

gernotheiser commented 2 years ago

I’d assume that it makes a difference whether the signalled thread is runnable or not, so should increase (apparent) non-determinism

pingerino commented 2 years ago

Not really, if the signalled thread is lower prio we won't switch to it and we will continue on the fastpath.

gernotheiser commented 2 years ago

the benchmark was signalling to high

pingerino commented 2 years ago

Right - switching it to passive would make the benchmark equivalent to signalling to low (no change in thread). I'm not sure I see the point there. IIRC there was a signal to low prio already somewhere in there.

gernotheiser commented 2 years ago

My understanding is that presently it is passive

malus-brandywine commented 2 years ago

Like I mentioned in the previous comment, the destination high prio thread is NOT passive.

malus-brandywine commented 2 years ago

At the moment I'm stepping off the PR for several weeks. Anyone willing to continue the research/redesign is welcome to join!

malus-brandywine commented 2 years ago

I supported the idea to have both processing methods of the benchmark at the same run, not mutually exclusively. Implemented

Also, I didn't change caller and callee threads, both remained active.

Other change requests were addressed.

kent-mcleod commented 2 years ago

I propose we merge this PR as it's currently implemented as it appears to be a working implementation of the early processing method and adds it as an additional benchmark so wont remove any existing benchmarking results. We can later choose whether to make an update to all other benchmarks that measure a mean and stddev as a separate PR?

imx8mm-evk 64-bit: Early processing method:

        "Mean": 556,
        "Stddev": 5.96284793999944,
        "Variance": 35.2,

Late processing method:

        "Min": 543,
        "Max": 2726,
        "Mean": 816.75,
        "Stddev": 701.7155672875851,
        "Variance": 487480.69,
        "Mode": 543,
        "Median": 555,
        "1st quantile": 543,
        "3rd quantile": 592,
gernotheiser commented 2 years ago

I have a student looking into this. There's no harm in merging now, but it seems better to do the complete job

lsf37 commented 1 year ago

Closing this PR, because the work has been included in #33 -- please re-open if I missed something.