Closed robingroppe closed 7 years ago
Preemption is enabled by default.
https://github.com/raspberrypi/linux/blob/rpi-4.1.y/arch/arm/configs/bcmrpi_defconfig#L42
We don't set CONFIG_HZ in the defconfig so the default of 100Hz is used - this in theory gives a 10ms timeslice. Can you increase the priority of the critical process to realtime with the same results?
Yes, the Low Latency Preemption is enabled while the rest of the Kernel is not so realtime. Check the difference between PREEMPT_VOLUNTARY and PREEMPT. For example Ubuntu always uses PREEMPT_VOLUNTARY for Desktops and Servers. Only the Low-Latency Kernel uses PREEMPT combined with 1000hz and a few other tweaks. Full Preemption causes Kernel overhead and reduces throughput.
Here is my modified kernel in case you want to run a few tests. http://robingroppe.de/media/rpi2/kernel7.xz
I followed this guide: https://www.raspberrypi.org/documentation/linux/kernel/building.md
So all you have to do is jump into the linux directory and...
$ sudo cp arch/arm/boot/dts/.dtb /boot/ $ sudo cp arch/arm/boot/dts/overlays/.dtb* /boot/overlays/ $ sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/ $ sudo scripts/mkknlimg arch/arm/boot/zImage /boot/kernel7.img
Just found a discussion about it on the Arch Board: http://archlinuxarm.org/forum/viewtopic.php?f=23&t=7907
@robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?
Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.
In this situation changing to voluntary will actually improve throughput. But i can do that. Do you have any tests you want me to rely on? Am 07.12.2015 14:03 schrieb "popcornmix" notifications@github.com:
@robingroppe https://github.com/robingroppe Can you test with CONFIG_PREEMPT_VOLUNTARY and with CONFIG_HZ_250 separately and report on the behaviour with each?
Both these settings will increase overhead and so reduce throughput, so we need to be very sure of exactly what the benefits are before enabling.
— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162518420.
No it may help latency but it won't improve throughput.
These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.
Okay. Voluntary causes the throughput to drop slightly but full preemption like you are using right now causes even more drop of throughput and additionally kernel overhead. Am 07.12.2015 14:35 schrieb "popcornmix" notifications@github.com:
No it may help latency but it won't improve throughput.
These new preemption points have been selected to reduce the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput.
http://cateee.net/lkddb/web-lkddb/PREEMPT_VOLUNTARY.html
— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162526062.
@popcornmix That's in comparison to PREEMPT_NONE, PREEMPT_VOLUNTARY came before PREEMPT_FULL (which is what the RPi is using). PREEMPT_NONE is the highest throughput option, but provides noticeable latency for many things that require interactive usage (which is why almost nobody who isn't doing exclusively HPC workloads uses it anymore), PREEMPT_FULL provides the lowest latency, but at the cost of significant throughput (and potential stability issues). PREEMPT_VOLUNTARY was originally a precursor to PREEMPT_FULL, but is now kept as a compromise between PREEMPT_FULL and PREEMPT_NONE.
I am compiling two more kernels. One with only 250hz and Full Preempt and one with 100hz and Voluntary Preempt. This will take a while. Has someone tested the modified kernel already?
@Ferroin Okay, so CONFIG_PREEMPT_VOLUNTARY will make latency worse compared to current. Possible an issue for some of the hardware drivers (LIRC/I2C/I2S/SPI etc).
Unless the drivers are in user-space, this should actually make things better for them. The latency impact is entirely in user-space, and technically, reducing preemption should make timing critical stuff work more reliably (assuming they have critical sections properly wrapped in preempt_off() calls).
But the kernel timer of 250hz or 4ms will reduce the latency even more than you will loose by switching to the other Preemption model.
That's pretty dependent on how thee drivers are handling preemption, as well as what you are doing in general on the system. I think for most use cases, if it doesn't fully offset the latency increase from PREEMPT_VOLUNTARY, it should come very close. Some people trying to do odd timing specific things with their hardware from userspace (the DHT11 temperature sensor immediately comes to mind) may have issues, but if they really need low latency, they should probably be building their own kernel with HZ=1000 anyway.
It's worth keeping in mind that the higher timer frequency will also increase power consumption (although I doubt that it will be more than a few micro-amps difference between HZ=100 and HZ=250), so it may be worth testing that as well.
I was going to stay out of this as I cannot provide any hard evidence. I don't know whether years of experience count? LOL
IMHO, the best balance, for general usage, between latency and throughput is achieved with VOLUNTARY_PREEMPT and 250HZ for headless, and VOLUNTARY_PREEMPT and 1000HZ for desktop/GUI. But I wouldn't change the current RPI default configs. If people want to change the defaults for their specific use cases, they can compile their own kernels......
I am not asking to change values for some specific usecase. I want a usable general purpose kernel right out of the box. Look what the big distros are using... There is a reason why the do it that way.
Might as well add what I use on various systems as well.
In general, I use one of four different configurations, depending on what the system is for:
Overall, the biggest impact from both of these options is how many mandatory context-switches they cause. Context switches are expensive, even on really well designed hardware, and are a large part of what hurts throughput in number crunching stuff. A higher frequency on the global timer interrupt (higher HZ value) increases the number of required context switches in direct proportion to it's value (each time it fires, you get at a minimum two context switches, one from the running task to the scheduler, and one from the scheduler to the new task it selected to run). It's harder to quantify what impact the PREEMPT options have, but the more preemption points are available, the more likely a context switch will happen.
@robingroppe 'All the big distros are doing it this way' is not a valid argument in general, and for embedded systems in particular.
Take for example Ubuntu's choice of what kernel to ship with their releases. Almost always, it's not a version that is tagged upstream for long-term-support.
On top of the numerous poor decisions that get made by big distros, you need to remember other than OpenWRT, Angstrom, and their friends, big distros are targeted at desktop systems or server systems, which have very different requirements from embedded systems.
Arguably, HZ=100's only advantages over HZ=250 are throughput (which is insignificant when you have horrible latency) and energy efficiency (which shouldn't be a primary consideration when using something with a 5-10W nominal power draw).
As far as PREEMPT_VOLUNTARY, that has a lot more potential to impact existing user code, but is much less significant than switching to PREEMPT_NONE.
Isnt the Pi meant to be a Desktop or a Server?
My Ubuntu install does have CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_250. OpenELEC on Pi has CONFIG_PREEMPT_VOLUNTARY and CONFIG_HZ_300. OSMC on Pi has CONFIG_PREEMPT and CONFIG_HZ_100 Might be worth checking what Archlinux on Pi uses.
Do we have anyone who objects to these settings? Comments @pelwell @P33M @notro ?
@robingroppe The original intent was to be an educational tool, for teaching basic programming skills, as well as basic electrical design skills. It's obviously evolved far beyond that (because at the time of release, it was the absolute cheapest SBC available that was actually usable beyond IoT type applications), but that doesn't mean that it's not an embedded system by nature.
@popcornmix IIRC, Fedora defaults to CONFIG_PREEMPT_VOLUNTARY and 200HZ for ARM kernel builds. (Just another useless piece of data.....)
@popcornmix I'd say that given the particularly heavy usage of the Pi for media center type things, HZ=300 is probably slightly preferred to HZ=250. The way that people seem to be using it, we should almost certainly be prioritizing minimal latency over maximal throughput, so I'd say it's still a tossup whether we really want PREEMPT_VOLUNTARY over PREEMPT_FULL.
I'm curious to see whether the difference in flat-out number crunching performance between HZ=100 and HZ=250/300 is measurable on an otherwise idle system. But in general, given that OpenELEC seems OK with VOLUNTARY/300 I think we could give it a try.
I can say that my Pi runs "time echo "scale=5000; 4*a(1)" | bc -l" in 2m07s on the stock kernel and 2m10s on a 1000hz rt kernel.
So 250 or 300hz should be nothing to worry about.
127 seconds versus 130s is 2.3% which I wouldn't say was nothing to worry about. We have spent a lot of time and effort for optimisations smaller than that.
Of course this is 1000Hz and whatever changes "rt" implies, so the actual difference is likely smaller. But in general we wouldn't accept a 1% performance loss without a very compelling reason.
Would be good to do the same test with just CONFIG_HZ_300 and with just CONFIG_PREEMPT_VOLUNTARY and report the changes.
Just checked ArchARM uses 200hz and voluntary Preemption. I will check the differnce with a 250hz kernel.
Stock:
250hz-Voluntary:
1.Attempt real 2m9.216s user 2m9.184s sys 0m0.012s
But dont forget that this is just one specific benchmark. The UnixBench results told another story.
Which was...?
So there is a 0.048% performance loss in this particular test.
Check the first post for the UnixBench results.
1.7% performance loss for 1 instance, 2.0% loss for 4 instances.
I think separate results for CONFIG_HZ_300 and CONFIG_PREEMPT_VOLUNTARY would be useful. My understanding is that CONFIG_HZ_300 will lower performance and CONFIG_PREEMPT_VOLUNTARY will improve it so seeing the individual effects is more useful.
Overall the 250hz kernel was faster in UnixBench. And a loss of 0.0x% pure number crunching performance for better latency is not so bad imo. I dont know numbers about a 300hz kernel as i was not asking for it. If someone has one compiled and installed he could run that test.
@popcornmix That really depends on what exactly you are doing.
Performance is hard to quantify without knowing the workload. For raw number crunching applications (in other words, sutf that people are probably only running on the Pi because it's so energy efficient compared to an equivalent performance x86 system), it will be a performance hit, but only if it's sitting in userspace most of the time instead of making syscalls. For something that's memory bound (like BOINC or other distributed computing applications), it's not as noticeable. For a desktop or media center, responsiveness is the preferred measure of performance usually (along the lines of 'when I hit this button to do X, how long does it actually take before X is done?'), in which case, lowering latency is usually more important than maximizing raw processing power.
In general, I hear of people using the Pi in these specific ways:
Case 1 should be focused on usability over all else. If you're trying to teach someone something, the less they have to learn about the system being used to teach the lesson, the better. Being efficient is important too, but it really doesn't break the ability to teach, and having a lower performance system encourages writing efficient user code, which is a good skill to have.
Case 2 and 3 are things that you should be focusing on minimal latency for, but need to remember usability. These are the cases that are going to benefit the most from increasing HZ from 100 to either 250 or 300. Most media center products that I know of that run Linux use CONFIG_PREEMPT_VOLUNTARY or CONFIG_PREEMPT_FULL and run with HZ=300. With the state of thing son the Pi right now, trying to actually multi-task on the desktop can be noticeably laggy and downright annoying.
Case 4 is usually going to focus on latency, and may benefit from a higher timer interrupt frequency. When it doesn't, then the users should be smart enough to build their own kernels.
Case 5 is really an edge case (I know exactly 3 people using the Pi for cluster computing, one is running Hadoop over 1000 nodes, the other two are using it with USB disks and GlusterFS over around 200 nodes (which is amazingly fast considering the 100M Ethernet interface and the bottleneck of using USB for both the network and disks)). This is really the only case that truly benefits from maximizing computing throughput in a actually quantifiable way. Clustering on something like the Pi also means you should be building your own kernel (and probably your own userspace as well), so it really shouldn't be a major target for the general distributions.
I also think the standard kernel should fit for Case 2 and 3. Most People i know are using it as a headless server for network applications or as a small desktop. The headless server thing will mostly also benfit from these changes. So what do you think?
So, to explain slightly further: The primary advantage to using HZ=100 is to minimize the number of timer interrupts occuring on the system. On a single CPU system, or even something with only 4 or 8 cores, this is usually not going to have a very measurable impact on performance. When it really starts to improve performance is when using it on really big systems (think along the lines of dozens of cores per CPU, with multiple CPU's), and that's because the timer interrupt happens on all the processors on the system, at the same time (unless you use skew_tick=1, which introduces a deterministic amount of jitter on each CPU so the timer interrupt happens at a different time on each, this is not something that has any real benefit for people not doing timing sensitive workloads, and in fact even increases power consumption on most systems). The reason you can run such systems with a lower frequency of time interrupt is because they can actually do real multi-tasking, and therefore it's more likely that a core will be idle to respond to whatever user input is occurring with minimal latency.
Sorry to butt in, been following thread with interest.
The standard kernel needs to cater to the educational needs, so needs to have a responsive desktop. Headless devices is not a major use case for education.
What is noticeable from the posts in this thread - no actual figures. Someone need to determine a set of tests to run that are representative for the major use case, and try them at the different settings proposed. That way the Foundation can make a valid assessment of the right figures to use.
On 8 December 2015 at 12:00, Robin Groppe notifications@github.com wrote:
I also think the standard kernel should fit for Case 2 and 3. Most People i know are using it as a headless server for network applications or as a small desktop. The headless server thing will mostly also benfit from these changes. So what do you think?
— Reply to this email directly or view it on GitHub https://github.com/raspberrypi/linux/issues/1216#issuecomment-162861343.
I dont know much about scientific benchmarking. If you do, please give it a go. A download link to the modified kernel has been posted. I can say that even on the terminal the system feels snappier.
We can't really scientifically benchmark desktop responsiveness, it's way too subjective to properly measure. The closest we can get is probably latencytop, which will need it's own config option turned on in the kernel to be usable (which will impact the results some). unixbench may be worthwhile as a way to determine throughput, but that's still not great.
In general, I'd say that the particular things to look at are:
Ideally, we should get as many samples as reasonably possible, as these are things that may be impacted.
nbench is very simple for testing integer/floating point and memory operations. A more advance gui benchmark would be something like cairo-traces. There seems to be a cutdown version more suitable for embedded platforms here: https://github.com/ssvb/trimmed-cairo-traces
Benchmarks are by nature synthetic workloads, thus no one benchmark is going to give us a full picture of the system performance.
Based on this, the wider variety of benchmarks we test, the better. I do think that we should at a minimum do something to test disk performance, as that is usually one of the biggest bottlenecks on most SBC type systems. I think that cairo-traces is definitely worth testing with. I still think that benchmarking HTML rendering is worth doing as well, most people I personally know who use the Pi as a desktop primarily use it for web browsing, and rendering HTML is complex enough that it should show performance variance pretty well.
Thanks to this discussion I was able to solve a major issue I was having with an embedded system here: When building my kernels I always used the very common recommendation for audio-workstations (mostly X86 with some power behind) which is 1000 Hz and PREEMPT.
Running squeezelite with sox-resampling, jack and brutefir with 96 kHz on the Pi (quite some tasks) lead to short crackling when the playlist changed from 96 kHz to 44.1 kHz.
Now I changed my kernel to PREEMPT_VOLUNTARY and 250 Hz and the issue is gone.
So again no benchmark but a hint.. ;) And thx from my side for bringing this up!
Yours is an easier decision, since as you have seen it reduces the overhead. Going from 100Hz to 250Hz adds some overhead, but possibly not enough to worry about - we won't know without some proper benchmarking.
This may be worth pointing out:
On the Pi, running with no overclocking and HZ=100, we get 7 million cycles between each timer interrupt. With HZ=250, it's approximately 2.8 million. Ignoring the scheduler overhead (because it's not deterministic, and I don't know what the minimum overhead for it is on ARM), that's a 40% reduction in processor time per time slice. This also ignores two specific things however:
HZ=1000, we get 7 million cycles
Surely that's a typo? 7GHz would be an aggressive overclock.
@pelwell yes, I meant at HZ=100
Now think about this. Every 10ms the kernel looks for something to do. What if a process is finished after lets say 1.5 million cycles? The scheduler does not know that until the full 10ms are over. So all it does for the rest of the slice is sit there and wait to fire up the scheduler again. This causes wait loops for the rest of the processes which may could have been done already if the scheduler have been aware of this situation. I would not call this overhead like the arch guys did but it is surely inefficiency. But thats the situation on a desktop or server. A lot of processes, not one which is crunching numbers all day.
Actually, assuming that the process does something that causes it to go into one of the sleep states (S or D in top and similar tools), then the reschedule should happen almost immediately. The issue is when you have more tasks that can run than you have CPU cores to run them on. The lower the scheduler interrupt frequency, the more a task can do before it gets forcibly preempted by some other runnable task (It's actually more complicated than this, because the Linux scheduler uses the nice value of a task not to prioritize it's position in the queue of runnable tasks, but to adjust it's scheduling time slice (higher nice values get longer time slices, and thus get to run more than lower nice values). This nicely solves many of the issues present in traditional priority queue schedulers found in older UNIX systems like SVR4, but makes it much harder to truly determine the impact of adjusting the scheduler interrupt frequency).
This is why increasing the frequency reduces raw computational throughput, but improves latency. Unlike most x86 desktops, the Pi has a slow enough processor that many things done on a desktop (for example, rendering this webpage, or fetching an e-mail) take more than one scheduling period to do, which means that they will tend to block other tasks from running for longer periods when the scheduling period is longer, thus hurting latency and responsiveness.
Please change the kernel to 250hz and voluntary preemption.
In everyday use my Pi with the modified kernel can take more work while still being responsive to all other running tasks. For example when i run the MumbleRubyPluginbot, with the stock kernel, which needs to be fed with data every 20ms or so and start an apt upgrade of some packets it quickly starts to lag. With the modified kernel everything is fine. You said you would need some evidence. I ran UnixBench on both kernels. I cant tell much about starting graphical apps, because my pi is running headless but i guess you guys are having crosscompilers set up and can easily build a modified kernel. By the way these two things i am mentioning are also used in almost every stockkernel in most distributions (Debian, Ubuntu...). And in my opinion there is a reason for that.
Unixbench Stock Kernel: https://robingroppe.de/media/rpi2/orig.txt Unixbench Modified Kernel: https://robingroppe.de/media/rpi2/mod.txt