Change to 250hz and voluntary preemption

robingroppe commented 8 years ago

Please change the kernel to 250hz and voluntary preemption.

In everyday use my Pi with the modified kernel can take more work while still being responsive to all other running tasks. For example when i run the MumbleRubyPluginbot, with the stock kernel, which needs to be fed with data every 20ms or so and start an apt upgrade of some packets it quickly starts to lag. With the modified kernel everything is fine. You said you would need some evidence. I ran UnixBench on both kernels. I cant tell much about starting graphical apps, because my pi is running headless but i guess you guys are having crosscompilers set up and can easily build a modified kernel. By the way these two things i am mentioning are also used in almost every stockkernel in most distributions (Debian, Ubuntu...). And in my opinion there is a reason for that.

Unixbench Stock Kernel: https://robingroppe.de/media/rpi2/orig.txt Unixbench Modified Kernel: https://robingroppe.de/media/rpi2/mod.txt

P33M commented 8 years ago

Preemption is enabled by default.

https://github.com/raspberrypi/linux/blob/rpi-4.1.y/arch/arm/configs/bcmrpi_defconfig#L42

We don't set CONFIG_HZ in the defconfig so the default of 100Hz is used - this in theory gives a 10ms timeslice. Can you increase the priority of the critical process to realtime with the same results?

robingroppe commented 8 years ago

Yes, the Low Latency Preemption is enabled while the rest of the Kernel is not so realtime. Check the difference between PREEMPT_VOLUNTARY and PREEMPT. For example Ubuntu always uses PREEMPT_VOLUNTARY for Desktops and Servers. Only the Low-Latency Kernel uses PREEMPT combined with 1000hz and a few other tweaks. Full Preemption causes Kernel overhead and reduces throughput.

robingroppe commented 8 years ago

Here is my modified kernel in case you want to run a few tests. http://robingroppe.de/media/rpi2/kernel7.xz

I followed this guide: https://www.raspberrypi.org/documentation/linux/kernel/building.md

So all you have to do is jump into the linux directory and...

$ sudo cp arch/arm/boot/dts/.dtb /boot/ $ sudo cp arch/arm/boot/dts/overlays/.dtb* /boot/overlays/ $ sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/ $ sudo scripts/mkknlimg arch/arm/boot/zImage /boot/kernel7.img

robingroppe commented 8 years ago

Just found a discussion about it on the Arch Board: http://archlinuxarm.org/forum/viewtopic.php?f=23&t=7907

popcornmix commented 8 years ago

@robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.

robingroppe commented 8 years ago

In this situation changing to voluntary will actually improve throughput. But i can do that. Do you have any tests you want me to rely on? Am 07.12.2015 14:03 schrieb "popcornmix" notifications@github.com:

@robingroppe https://github.com/robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?

Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.

— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162518420.

popcornmix commented 8 years ago

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html

robingroppe commented 8 years ago

Okay. Voluntary causes the throughput to drop slightly but full preemption like you are using right now causes even more drop of throughput and additionally kernel overhead. Am 07.12.2015 14:35 schrieb "popcornmix" notifications@github.com:

No it may help latency but it won't improve throughput.

These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.

http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html

— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162526062.

Ferroin commented 8 years ago

@popcornmix That's in comparison to PREEMPT_NONE, PREEMPT_VOLUNTARY came before PREEMPT_FULL (which is what the RPi is using). PREEMPT_NONE is the highest throughput option, but provides noticeable latency for many things that require interactive usage (which is why almost nobody who isn't doing exclusively HPC workloads uses it anymore), PREEMPT_FULL provides the lowest latency, but at the cost of significant throughput (and potential stability issues). PREEMPT_VOLUNTARY was originally a precursor to PREEMPT_FULL, but is now kept as a compromise between PREEMPT_FULL and PREEMPT_NONE.

robingroppe commented 8 years ago

I am compiling two more kernels. One with only 250hz and Full Preempt and one with 100hz and Voluntary Preempt. This will take a while. Has someone tested the modified kernel already?

popcornmix commented 8 years ago

@Ferroin Okay, so CONFIG_PREEMPT_VOLUNTARY will make latency worse compared to current. Possible an issue for some of the hardware drivers (LIRC/I2C/I2S/SPI etc).

Ferroin commented 8 years ago

Unless the drivers are in user-space, this should actually make things better for them. The latency impact is entirely in user-space, and technically, reducing preemption should make timing critical stuff work more reliably (assuming they have critical sections properly wrapped in preempt_off() calls).

robingroppe commented 8 years ago

But the kernel timer of 250hz or 4ms will reduce the latency even more than you will loose by switching to the other Preemption model.

Ferroin commented 8 years ago

That's pretty dependent on how thee drivers are handling preemption, as well as what you are doing in general on the system. I think for most use cases, if it doesn't fully offset the latency increase from PREEMPT_VOLUNTARY, it should come very close. Some people trying to do odd timing specific things with their hardware from userspace (the DHT11 temperature sensor immediately comes to mind) may have issues, but if they really need low latency, they should probably be building their own kernel with HZ=1000 anyway.

It's worth keeping in mind that the higher timer frequency will also increase power consumption (although I doubt that it will be more than a few micro-amps difference between HZ=100 and HZ=250), so it may be worth testing that as well.

clivem commented 8 years ago

I was going to stay out of this as I cannot provide any hard evidence. I don't know whether years of experience count? LOL

IMHO, the best balance, for general usage, between latency and throughput is achieved with VOLUNTARY_PREEMPT and 250HZ for headless, and VOLUNTARY_PREEMPT and 1000HZ for desktop/GUI. But I wouldn't change the current RPI default configs. If people want to change the defaults for their specific use cases, they can compile their own kernels......

robingroppe commented 8 years ago

I am not asking to change values for some specific usecase. I want a usable general purpose kernel right out of the box. Look what the big distros are using... There is a reason why the do it that way.

Ferroin commented 8 years ago

Might as well add what I use on various systems as well.

In general, I use one of four different configurations, depending on what the system is for:

HZ=100 PREEMPT_NONE (I use this for stuff that is solely for number crunching or other similar processor bound things, it provides the best overall throughput, but latency is horrible. The only systems I run this way are usually dedicated BOINC clients, and on occasion VM's for testing particular things).
HZ=250 PREEMPT_VOLUNTARY (I use this for most systems when I don't have some particular reason to use anything else. When I use a custom kernel on the Pi, I usually use this configuration).
HZ=300 PREEMPT_FULL (I use this on systems that I need to do multimedia work on, but don't need true real-time performance. The particular reasoning being that 300 is exactly divisible by both PAL and NTSC frame rates, so it's a bit better for live video editing).
HZ=1000 PREEMPT_FULL (I use this only on stuff that needs absolute minimal latency, usually when I'm doing something that requires real-time guarantees. It's horribly energy inefficient (about 20-30W greater power consumption compared to HZ=250 PREEMPT_VOLUNTARY on an AMD FX-8320), and really trashes throughput for computations (most benchmarks are noticeably lower with this than HZ=250 and PREEMPT_VOLUNTARY)).

Overall, the biggest impact from both of these options is how many mandatory context-switches they cause. Context switches are expensive, even on really well designed hardware, and are a large part of what hurts throughput in number crunching stuff. A higher frequency on the global timer interrupt (higher HZ value) increases the number of required context switches in direct proportion to it's value (each time it fires, you get at a minimum two context switches, one from the running task to the scheduler, and one from the scheduler to the new task it selected to run). It's harder to quantify what impact the PREEMPT options have, but the more preemption points are available, the more likely a context switch will happen.

Ferroin commented 8 years ago

@robingroppe 'All the big distros are doing it this way' is not a valid argument in general, and for embedded systems in particular.

Take for example Ubuntu's choice of what kernel to ship with their releases. Almost always, it's not a version that is tagged upstream for long-term-support.

On top of the numerous poor decisions that get made by big distros, you need to remember other than OpenWRT, Angstrom, and their friends, big distros are targeted at desktop systems or server systems, which have very different requirements from embedded systems.

Arguably, HZ=100's only advantages over HZ=250 are throughput (which is insignificant when you have horrible latency) and energy efficiency (which shouldn't be a primary consideration when using something with a 5-10W nominal power draw).

As far as PREEMPT_VOLUNTARY, that has a lot more potential to impact existing user code, but is much less significant than switching to PREEMPT_NONE.

robingroppe commented 8 years ago

Isnt the Pi meant to be a Desktop or a Server?

popcornmix commented 8 years ago

My Ubuntu install does have CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_250. OpenELEC on Pi has CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_300. OSMC on Pi has CONFIG_PREEMPT and CONFIG_HZ_100 Might be worth checking what Archlinux on Pi uses.

Do we have anyone who objects to these settings? Comments @pelwell @P33M @notro ?

Ferroin commented 8 years ago

@robingroppe The original intent was to be an educational tool, for teaching basic programming skills, as well as basic electrical design skills. It's obviously evolved far beyond that (because at the time of release, it was the absolute cheapest SBC available that was actually usable beyond IoT type applications), but that doesn't mean that it's not an embedded system by nature.

clivem commented 8 years ago

@popcornmix IIRC, Fedora defaults to CONFIG_PREEMPT_VOLUNTARY and 200HZ for ARM kernel builds. (Just another useless piece of data.....)

Ferroin commented 8 years ago

@popcornmix I'd say that given the particularly heavy usage of the Pi for media center type things, HZ=300 is probably slightly preferred to HZ=250. The way that people seem to be using it, we should almost certainly be prioritizing minimal latency over maximal throughput, so I'd say it's still a tossup whether we really want PREEMPT_VOLUNTARY over PREEMPT_FULL.

pelwell commented 8 years ago

I'm curious to see whether the difference in flat-out number crunching performance between HZ=100 and HZ=250/300 is measurable on an otherwise idle system. But in general, given that OpenELEC seems OK with VOLUNTARY/300 I think we could give it a try.

robingroppe commented 8 years ago

I can say that my Pi runs "time echo "scale=5000; 4*a(1)" | bc -l" in 2m07s on the stock kernel and 2m10s on a 1000hz rt kernel.

robingroppe commented 8 years ago

So 250 or 300hz should be nothing to worry about.

popcornmix commented 8 years ago

127 seconds versus 130s is 2.3% which I wouldn't say was nothing to worry about. We have spent a lot of time and effort for optimisations smaller than that.

Of course this is 1000Hz and whatever changes "rt" implies, so the actual difference is likely smaller. But in general we wouldn't accept a 1% performance loss without a very compelling reason.

Would be good to do the same test with just CONFIG_HZ_300 and with just CONFIG_PREEMPT_VOLUNTARY and report the changes.

robingroppe commented 8 years ago

Just checked ArchARM uses 200hz and voluntary Preemption. I will check the differnce with a 250hz kernel.

robingroppe commented 8 years ago

Stock:

Attempt real 2m22.995s user 2m8.580s sys 0m0.050s
Attempt real 2m8.763s user 2m8.760s sys 0m0.000s
Attempt real 2m8.713s user 2m8.690s sys 0m0.000s

250hz-Voluntary:

1.Attempt real 2m9.216s user 2m9.184s sys 0m0.012s

Attempt real 2m9.370s user 2m9.360s sys 0m0.008s
Attempt real 2m9.252s user 2m9.244s sys 0m0.008s

But dont forget that this is just one specific benchmark. The UnixBench results told another story.

pelwell commented 8 years ago

Which was...?

robingroppe commented 8 years ago

So there is a 0.048% performance loss in this particular test.

robingroppe commented 8 years ago

Check the first post for the UnixBench results.

pelwell commented 8 years ago

1.7% performance loss for 1 instance, 2.0% loss for 4 instances.

popcornmix commented 8 years ago

I think separate results for CONFIG_HZ_300 and CONFIG_PREEMPT_VOLUNTARY would be useful. My understanding is that CONFIG_HZ_300 will lower performance and CONFIG_PREEMPT_VOLUNTARY will improve it so seeing the individual effects is more useful.

robingroppe commented 8 years ago

Overall the 250hz kernel was faster in UnixBench. And a loss of 0.0x% pure number crunching performance for better latency is not so bad imo. I dont know numbers about a 300hz kernel as i was not asking for it. If someone has one compiled and installed he could run that test.

Ferroin commented 8 years ago

@popcornmix That really depends on what exactly you are doing.

Performance is hard to quantify without knowing the workload. For raw number crunching applications (in other words, sutf that people are probably only running on the Pi because it's so energy efficient compared to an equivalent performance x86 system), it will be a performance hit, but only if it's sitting in userspace most of the time instead of making syscalls. For something that's memory bound (like BOINC or other distributed computing applications), it's not as noticeable. For a desktop or media center, responsiveness is the preferred measure of performance usually (along the lines of 'when I hit this button to do X, how long does it actually take before X is done?'), in which case, lowering latency is usually more important than maximizing raw processing power.

In general, I hear of people using the Pi in these specific ways:

Education (the original intent)
Inexpensive desktops.
Media centers.
Embedded systems.
Beowulf clustering.

Case 1 should be focused on usability over all else. If you're trying to teach someone something, the less they have to learn about the system being used to teach the lesson, the better. Being efficient is important too, but it really doesn't break the ability to teach, and having a lower performance system encourages writing efficient user code, which is a good skill to have.

Case 2 and 3 are things that you should be focusing on minimal latency for, but need to remember usability. These are the cases that are going to benefit the most from increasing HZ from 100 to either 250 or 300. Most media center products that I know of that run Linux use CONFIG_PREEMPT_VOLUNTARY or CONFIG_PREEMPT_FULL and run with HZ=300. With the state of thing son the Pi right now, trying to actually multi-task on the desktop can be noticeably laggy and downright annoying.

Case 4 is usually going to focus on latency, and may benefit from a higher timer interrupt frequency. When it doesn't, then the users should be smart enough to build their own kernels.

Case 5 is really an edge case (I know exactly 3 people using the Pi for cluster computing, one is running Hadoop over 1000 nodes, the other two are using it with USB disks and GlusterFS over around 200 nodes (which is amazingly fast considering the 100M Ethernet interface and the bottleneck of using USB for both the network and disks)). This is really the only case that truly benefits from maximizing computing throughput in a actually quantifiable way. Clustering on something like the Pi also means you should be building your own kernel (and probably your own userspace as well), so it really shouldn't be a major target for the general distributions.

robingroppe commented 8 years ago

I also think the standard kernel should fit for Case 2 and 3. Most People i know are using it as a headless server for network applications or as a small desktop. The headless server thing will mostly also benfit from these changes. So what do you think?

Ferroin commented 8 years ago

So, to explain slightly further: The primary advantage to using HZ=100 is to minimize the number of timer interrupts occuring on the system. On a single CPU system, or even something with only 4 or 8 cores, this is usually not going to have a very measurable impact on performance. When it really starts to improve performance is when using it on really big systems (think along the lines of dozens of cores per CPU, with multiple CPU's), and that's because the timer interrupt happens on all the processors on the system, at the same time (unless you use skew_tick=1, which introduces a deterministic amount of jitter on each CPU so the timer interrupt happens at a different time on each, this is not something that has any real benefit for people not doing timing sensitive workloads, and in fact even increases power consumption on most systems). The reason you can run such systems with a lower frequency of time interrupt is because they can actually do real multi-tasking, and therefore it's more likely that a core will be idle to respond to whatever user input is occurring with minimal latency.

JamesH65 commented 8 years ago

Sorry to butt in, been following thread with interest.

The standard kernel needs to cater to the educational needs, so needs to have a responsive desktop. Headless devices is not a major use case for education.

What is noticeable from the posts in this thread - no actual figures. Someone need to determine a set of tests to run that are representative for the major use case, and try them at the different settings proposed. That way the Foundation can make a valid assessment of the right figures to use.

On 8 December 2015 at 12:00, Robin Groppe notifications@github.com wrote:

I also think the standard kernel should fit for Case 2 and 3. Most People i know are using it as a headless server for network applications or as a small desktop. The headless server thing will mostly also benfit from these changes. So what do you think?

— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162861343.

robingroppe commented 8 years ago

I dont know much about scientific benchmarking. If you do, please give it a go. A download link to the modified kernel has been posted. I can say that even on the terminal the system feels snappier.

Ferroin commented 8 years ago

We can't really scientifically benchmark desktop responsiveness, it's way too subjective to properly measure. The closest we can get is probably latencytop, which will need it's own config option turned on in the kernel to be usable (which will impact the results some). unixbench may be worthwhile as a way to determine throughput, but that's still not great.

In general, I'd say that the particular things to look at are:

How long it takes to get to a login/desktop from the time power is applied (this is hard to measure properly, best option is probably to use time-stamps in the logs).
What numbers do we get from something like bonnie++ (the big limiting factor for most usage on the Pi is usually storage, so this has a significant impact on latency).
How long does it take to render a reference web page (the reference page doesn't have to be really complex, something like acid3 should be fine), there's a frontend for webkit that renders directly to a PNG image, that might be useful for testing this.
How does it impact things that are memory bound as opposed to CPU or I/O bound? (A good test for this would be timing a long effect chain in SoX processing some reference audio, I'll look at throwing together a script to test this).

Ideally, we should get as many samples as reasonably possible, as these are things that may be impacted.

popcornmix commented 8 years ago

nbench is very simple for testing integer/floating point and memory operations. A more advance gui benchmark would be something like cairo-traces. There seems to be a cutdown version more suitable for embedded platforms here: https://github.com/ssvb/trimmed-cairo-traces

Ferroin commented 8 years ago

Benchmarks are by nature synthetic workloads, thus no one benchmark is going to give us a full picture of the system performance.

Based on this, the wider variety of benchmarks we test, the better. I do think that we should at a minimum do something to test disk performance, as that is usually one of the biggest bottlenecks on most SBC type systems. I think that cairo-traces is definitely worth testing with. I still think that benchmarking HTML rendering is worth doing as well, most people I personally know who use the Pi as a desktop primarily use it for web browsing, and rendering HTML is complex enough that it should show performance variance pretty well.

mr-berndt commented 8 years ago

Thanks to this discussion I was able to solve a major issue I was having with an embedded system here: When building my kernels I always used the very common recommendation for audio-workstations (mostly X86 with some power behind) which is 1000 Hz and PREEMPT.

Running squeezelite with sox-resampling, jack and brutefir with 96 kHz on the Pi (quite some tasks) lead to short crackling when the playlist changed from 96 kHz to 44.1 kHz.

Now I changed my kernel to PREEMPT_VOLUNTARY and 250 Hz and the issue is gone.

So again no benchmark but a hint.. ;) And thx from my side for bringing this up!

pelwell commented 8 years ago

Yours is an easier decision, since as you have seen it reduces the overhead. Going from 100Hz to 250Hz adds some overhead, but possibly not enough to worry about - we won't know without some proper benchmarking.

Ferroin commented 8 years ago

This may be worth pointing out:

On the Pi, running with no overclocking and HZ=100, we get 7 million cycles between each timer interrupt. With HZ=250, it's approximately 2.8 million. Ignoring the scheduler overhead (because it's not deterministic, and I don't know what the minimum overhead for it is on ARM), that's a 40% reduction in processor time per time slice. This also ignores two specific things however:

For a compute heavy workload with one task per cpu, the only hit to performance is the added scheduler overhead, as there are still 700 million cycles in one second, no matter what the scheduling interrupt frequency is.
For a realistic desktop workload (mostly idle, but with lots of processes and threads), the more frequent scheduling directly improves responsiveness, which improves the net productivity of the user themselves (you obviously get more work done the less you are sitting around waiting for your computer to respond).

pelwell commented 8 years ago

HZ=1000, we get 7 million cycles

Surely that's a typo? 7GHz would be an aggressive overclock.

Ferroin commented 8 years ago

@pelwell yes, I meant at HZ=100

robingroppe commented 8 years ago

Now think about this. Every 10ms the kernel looks for something to do. What if a process is finished after lets say 1.5 million cycles? The scheduler does not know that until the full 10ms are over. So all it does for the rest of the slice is sit there and wait to fire up the scheduler again. This causes wait loops for the rest of the processes which may could have been done already if the scheduler have been aware of this situation. I would not call this overhead like the arch guys did but it is surely inefficiency. But thats the situation on a desktop or server. A lot of processes, not one which is crunching numbers all day.

Ferroin commented 8 years ago

Actually, assuming that the process does something that causes it to go into one of the sleep states (S or D in top and similar tools), then the reschedule should happen almost immediately. The issue is when you have more tasks that can run than you have CPU cores to run them on. The lower the scheduler interrupt frequency, the more a task can do before it gets forcibly preempted by some other runnable task (It's actually more complicated than this, because the Linux scheduler uses the nice value of a task not to prioritize it's position in the queue of runnable tasks, but to adjust it's scheduling time slice (higher nice values get longer time slices, and thus get to run more than lower nice values). This nicely solves many of the issues present in traditional priority queue schedulers found in older UNIX systems like SVR4, but makes it much harder to truly determine the impact of adjusting the scheduler interrupt frequency).

This is why increasing the frequency reduces raw computational throughput, but improves latency. Unlike most x86 desktops, the Pi has a slow enough processor that many things done on a desktop (for example, rendering this webpage, or fetching an e-mail) take more than one scheduling period to do, which means that they will tend to block other tasks from running for longer periods when the scheduling period is longer, thus hurting latency and responsiveness.

raspberrypi / linux

Change to 250hz and voluntary preemption #1216