Closed mixxxbot closed 1 year ago
Commented by: daschuer Date: 2014-12-19T22:53:13Z
Links from IRC: http://musicdsp.org/files/denormal.pdf http://ldesoras.free.fr/doc/articles/denormal-en.pdf
Commented by: daschuer Date: 2014-12-19T23:14:37Z
Possible solution: https://www.qt.gitorious.org/qt/qtwebengine-chromium/source/b68d7ce5e7e3466409c942c514064d45b31ee666:chromium/third_party/WebKit/Source/platform/audio/DenormalDisabler.h
Commented by: ywwg Date: 2014-12-20T00:29:13Z
I removed out denormal code earlier this year because it caused horrible audible spikes in the EQ filters as the wave approached zero due to waveform discontinuity. Partially this was because our denormal code was way outside the limits of where this cpu penalty is actually applied. When RJ and I looked closer, we already use a gcc flag to disable ultra-small values anyway, so the compiler is denormaling for us.
Commented by: ywwg Date: 2014-12-20T00:32:34Z
If you do want to pursue this change, I would require a battery of tests that show that none of our filters are thrown off by the change in the sound wave. But I would urge you to confirm that it's actually a problem by demonstrating the CPU impact first before writing a bunch of new code.
Commented by: daschuer Date: 2014-12-20T08:37:29Z
Can you recall which gcc Flag it is? I cannot find it.
http://frozenfractal.com/blog/2010/3/11/optimization-story/:
" Luckily, there is an instruction to change the CPU’s behaviour: instead of storing denormalized values, these can simply be flushed to 0. Unfortunately, there is no standard library function for this. On Visual C++, we can do this:
_controlfp(_MCW_DN, _DN_FLUSH);
On gcc, we need some inline assembly. This was my first x86 assembly ever:
int mxcsr;
__asm__("stmxcsr %0" : "=m"(mxcsr) : :);
mxcsr |= (1 << 15); // set bit 15: flush-to-zero mode
__asm__("ldmxcsr %0" : : "m"(mxcsr) :);
"
Commented by: rryan Date: 2014-12-20T14:09:42Z
Hm, I had forgotten about that Owen:
The flag is -ffast-math -- according to the GCC docs here it enables flush-to-zero on some platforms though it isn't specific about which ones: https://gcc.gnu.org/wiki/FloatingPointMath
Commented by: rryan Date: 2014-12-20T14:49:24Z Attachments: ftz_stats
On my MBP (x86_64) adding FTZ in an experiment didn't have much effect.
1) Turn off waveforms 2) Load song, wait for analysis to complete 3) adjust EQs to non-neutral 4) play song 5) wait 6) record 40 seconds of base 7) record 40 seconds of experiment
Commented by: ywwg Date: 2014-12-20T16:06:58Z
Here's the other documentation I was using -- ffast-math + sse flags: http://carlh.net/plugins/denormals.php
Commented by: daschuer Date: 2014-12-22T21:38:17Z
the filter code suffers denormals. I have checked this by adding
if (!std::isnormal(buf[3])) {
qDebug() << "denormal";
}
Commented by: daschuer Date: 2014-12-22T21:48:23Z
@RJ: Did you use a sse3 build? It might be possible that your code does nothing on default builds see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21408
Commented by: ywwg Date: 2014-12-22T21:50:53Z
The last time someone tried to "fix" denormals, it caused major audio artifacts. If we're going to try to do this again, we're going to need:
This work should wait for post-release.
Commented by: daschuer Date: 2014-12-22T22:43:07Z
@Owen: What did you do last time to flush denormals?
I have just read that denormals are flushed by default on Mac Os audio callback.
So it can't be that bad.
@RJ, can you verify that?
Commented by: ywwg Date: 2014-12-22T22:58:41Z
The previous fix involved checking to see if the value was within abs(.000001) or some incredibly insufficiently small number, and just set the value to 0 if so It was a really bad fix that was not tested well
Commented by: daschuer Date: 2014-12-22T23:05:00Z
It look like we can rely on this: http://carlh.net/plugins/denormals.php
@RJ, what does it men for the default Mixxx optimization flags?
Commented by: daschuer Date: 2014-12-22T23:27:37Z
My results:
Thread model: posix gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz subnormal 12.0642 times slower. -ffast-math enabled subnormal 1.14059 times slower. SSE enabled FTZ=0 DAZ=0 subnormal 12.0405 times slower. SSE enabled FTZ=0 DAZ=1 subnormal 0.999144 times slower. SSE enabled FTZ=1 DAZ=0 subnormal 12.0839 times slower. SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 1.00153 times slower. SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.01593 times slower. SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.0017 times slower.
Commented by: daschuer Date: 2014-12-23T06:59:07Z
hread-Modell: posix gcc-Version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz subnormal 12.0694 times slower. -ffast-math enabled subnormal 1.01273 times slower. SSE enabled FTZ=0 DAZ=0 subnormal 11.8485 times slower. SSE enabled FTZ=0 DAZ=1 subnormal 0.991461 times slower. SSE enabled FTZ=1 DAZ=0 subnormal 13.0207 times slower. SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.96152 times slower. SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.00095 times slower. SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.04673 times slower.
Commented by: daschuer Date: 2014-12-23T13:45:18Z
On the same device, but a Virtual 32 bit OS:
Thread model: posix gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz subnormal 23.8905 times slower. -ffast-math enabled subnormal 24.4369 times slower. SSE enabled FTZ=0 DAZ=0 subnormal 10.5012 times slower. SSE enabled FTZ=0 DAZ=1 subnormal 0.958384 times slower. SSE enabled FTZ=1 DAZ=0 subnormal 11.4803 times slower. SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.998909 times slower. SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.990018 times slower. SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.00707 times slower.
Commented by: ywwg Date: 2014-12-23T13:54:04Z
I'm not quite sure what all the acronyms are, would you suggest a change in our build flags?
Commented by: daschuer Date: 2014-12-23T14:20:52Z
Probably yes. I am still missing a test on my 32 bit Atom Netbook too prove it. But I think we have already enough data for a conclusion.
1.) Since our Filters are Infinite, they will produce denominals. I have proved it by a test. 2.) Because of the -ffast-math flag Mixxx 64 bit builds, have no penalty by denormals. I have pr roved this on my devices and the cloumn at http://carlh.net/plugins/denormals.php is green for 64 bit CPUs and -ffast-math only. 3.) There is a performance penalty on 32 bit Mixxx builds even tough the -ffast-math flag is set. We need to enable sse and set the DAZ flag to have the same benefit as the 64 bit build.
I do not know, big the relation to the entire CPU time in the Audio callback is, (will test it later) but since we have a solution for 64 bit, we should solve the issue for 32 bit as well.
Commented by: daschuer Date: 2014-12-23T19:09:08Z
Thread model: posix gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz subnormal 45.0023 times slower. -ffast-math enabled subnormal 37.7372 times slower. SSE enabled FTZ=0 DAZ=0 subnormal 19.1181 times slower. SSE enabled FTZ=0 DAZ=1 subnormal 0.998239 times slower. SSE enabled FTZ=1 DAZ=0 subnormal 18.8671 times slower. SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.999303 times slower. SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.999885 times slower. SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 0.998566 times slower.
Commented by: daschuer Date: 2014-12-24T10:05:31Z
Test results from my Atom Notebook
Build with default settings
Debug [Main]:
Stat("LinkwitzRiley8EQEffect","count=2393,sum=7.27838e+09ns,average=3.04153e+06ns,min=910102ns,max=5.17381e+07ns,variance=5.35149e+13ns^2,stddev=7.31539e+06ns")
Debug [Main]: Stat("EngineMaster::process_duration","count=2678,average=3.44708e+06ns,min=207010ns,max=6.8616e+07ns,variance=5.4347e+13ns^2,stddev=7.37204e+06ns")
Build with scons -j2 optimize=2
Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=2200,sum=3.39283e+09ns,average=1.5422e+06ns,min=699111ns,max=2.511e+07ns,variance=7.1921e+12ns^2,stddev=2.68181e+06ns")
Debug [Main]: Stat("EngineMaster::process_duration","count=2448,average=2.35673e+06ns,min=188362ns,max=2.64581e+07ns,variance=8.66118e+12ns^2,stddev=2.94299e+06ns")
The average time for the EQ is nearly the 1/2 of the non sse version. Interesting is that the max value is also doubled and not x 20 as we might expect by the denormals calulations.
Conclusion: There is a BIG benefit of SSE 32 bit builds. This should be the default for source builds.
For binary distributions, we should strongly consider to drop Pentium 3 support. .. or offer sse and non sse builds.
It might be a problem for the Linux distros to drop Pentium 3 :-/
Commented by: ywwg Date: 2014-12-24T15:20:02Z
I would have no problem with dropping pentium 3. Even "old" netbooks are still going to have an Atom or Celeron or more modern CPU than a pentium 3.
Commented by: daschuer Date: 2014-12-25T18:50:41Z
It looks like DAZ is standard on armhf builds ...
https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
" If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision. "
Commented by: daschuer Date: 2014-12-26T00:16:50Z
This is the result on the same hardware as above using "optimize=2" and just -O2
Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=4418,sum=2.54514e+10ns,average=5.76084e+06ns,min=743041ns,max=3.95725e+07ns,variance=8.41122e+13ns^2,stddev=9.17127e+06ns")
Debug [Main]: Debug [Main]: Stat("EngineMaster::process_duration","count=4727,average=6.27221e+06ns,min=201701ns,max=4.46743e+07ns,variance=8.15258e+13ns^2,stddev=9.02916e+06ns")
The filter and Engine code takes ~3 times more. Conclusion: it is a good idea to use -O3 + -funroll-loops
Commented by: daschuer Date: 2014-12-26T01:19:33Z
An Yes, we need RJs patch
I get heavy load if I play a track in one deck using Linkwitz-Riley EQ and turn Gain the to zero. With the patch, there is no load change when turning to Zero. I can see similar results on I5 Notebook x64 with small Audiobuffers.
Enabling DAZ helps. I do not have a clue why this happens on a SSE2 build? According to the test above this does not happen ... So there seams to be an other issue.
Debug [Main]: =====================================
Debug [Main]: BASE STATS
Debug [Main]: =====================================
Debug [Main]: Stat("AnalyserQueue process","count=1")
Debug [Main]: Stat("CachingReaderWorker [Channel1]","count=574")
Debug [Main]: Stat("CachingReaderWorker [Channel2]","count=1")
Debug [Main]: Stat("CachingReaderWorker [PreviewDeck1]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler1]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler2]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler3]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler4]","count=1")
Debug [Main]: Stat("EngineBuffer::process_pauselock","count=2481,sum=1.97225e+08ns,average=79494.2ns,min=30311ns,max=1.09239e+06ns,variance=3.04507e+09ns^2,stddev=55182.2ns")
Debug [Main]: Stat("EngineMaster::mixChannels_0active","count=7443,sum=4.85571e+07ns,average=6523.86ns,min=3282ns,max=730540ns,variance=1.4778e+08ns^2,stddev=12156.5ns")
Debug [Main]: Stat("EngineMaster::mixChannels_1active","count=2481,sum=3.467e+07ns,average=13974.2ns,min=7333ns,max=833346ns,variance=6.11274e+08ns^2,stddev=24724ns")
Debug [Main]: Stat("EngineMaster::process","count=4962")
Debug [Main]: Stat("EngineMaster::processChannels","count=2480,sum=2.2165e+10ns,average=8.9375e+06ns,min=960248ns,max=5.28434e+07ns,variance=1.24014e+14ns^2,stddev=1.11361e+07ns")
Debug [Main]: Stat("EngineSideChain","count=191")
Debug [Main]: Stat("EngineSideChain::process","count=192")
Debug [Main]: Stat("EngineSideChain::writeSamples","count=4962")
Debug [Main]: Stat("EngineSideChain::writeSamples wake up","count=190")
Debug [Main]: Stat("EngineWorkerScheduler","count=572")
Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=2480,sum=2.11762e+10ns,average=8.5388e+06ns,min=743949ns,max=5.25722e+07ns,variance=1.24491e+14ns^2,stddev=1.11575e+07ns")
Debug [Main]: Stat("MixxxMainWindow::~MixxxMainWindow","count=1,sum=1.13623e+09ns,average=1.13623e+09ns,min=1.13623e+09ns,max=1.13623e+09ns,variance=0ns^2,stddev=0ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess output 0, HDA Intel: ALC269 Analog (hw:0,0)","count=2481,sum=1.89776e+08ns,average=76491.7ns,min=22628ns,max=1.06907e+07ns,variance=2.21524e+11ns^2,stddev=470664ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess prepare 0, HDA Intel: ALC269 Analog (hw:0,0)","count=2480,sum=2.34364e+10ns,average=9.45015e+06ns,min=1.14386e+06ns,max=5.31427e+07ns,variance=1.24151e+14ns^2,stddev=1.11423e+07ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcessClkRef 0, HDA Intel: ALC269 Analog (hw:0,0)","count=4962")
Debug [Main]: Stat("VsyncThread real time error","count=17,sum=17,average=1,min=1,max=1,variance=0^2,stddev=0")
Debug [Main]: Stat("VsyncThread usleep for VSync","count=3990")
Debug [Main]: Stat("VsyncThread vsync render","count=4024")
Debug [Main]: Stat("VsyncThread vsync swap","count=4025")
Debug [Main]: Stat("WOverview::paintEvent","count=92,sum=4.11161e+07ns,average=446914ns,min=22768ns,max=885378ns,variance=3.54065e+10ns^2,stddev=188166ns")
Debug [Main]: Stat("WVuMeter::paintEvent","count=3565,sum=2.34213e+08ns,average=65697.9ns,min=37365ns,max=1.74631e+06ns,variance=1.82733e+09ns^2,stddev=42747.3ns")
Debug [Main]: Stat("WaveformWidgetFactory::render() 2waveforms","count=2012,sum=2.01004e+10ns,average=9.99028e+06ns,min=5.18285e+06ns,max=4.00311e+07ns,variance=8.70301e+12ns^2,stddev=2.95009e+06ns")
Debug [Main]: Stat("WaveformWidgetFactory::swap() 2waveforms","count=2012,sum=2.95026e+09ns,average=1.46633e+06ns,min=880349ns,max=1.21566e+07ns,variance=5.5925e+11ns^2,stddev=747830ns")
Debug [Main]: =====================================
Debug [Main]: EXPERIMENT STATS
Debug [Main]: =====================================
Debug [Main]: Stat("CachingReaderWorker [Channel1]","count=191")
Debug [Main]: Stat("EngineBuffer::process_pauselock","count=762,sum=6.66809e+07ns,average=87507.8ns,min=50635ns,max=1.92692e+06ns,variance=8.42527e+09ns^2,stddev=91789.2ns")
Debug [Main]: Stat("EngineMaster::mixChannels_0active","count=2289,sum=1.52183e+07ns,average=6648.44ns,min=3841ns,max=331048ns,variance=9.07444e+07ns^2,stddev=9525.98ns")
Debug [Main]: Stat("EngineMaster::mixChannels_1active","count=763,sum=1.0052e+07ns,average=13174.3ns,min=7333ns,max=181657ns,variance=5.79651e+07ns^2,stddev=7613.48ns")
Debug [Main]: Stat("EngineMaster::process","count=1525")
Debug [Main]: Stat("EngineMaster::processChannels","count=762,sum=1.24482e+09ns,average=1.63363e+06ns,min=954591ns,max=5.73739e+06ns,variance=4.66768e+11ns^2,stddev=683204ns")
Debug [Main]: Stat("EngineSideChain","count=58")
Debug [Main]: Stat("EngineSideChain::process","count=58")
Debug [Main]: Stat("EngineSideChain::writeSamples","count=1526")
Debug [Main]: Stat("EngineSideChain::writeSamples wake up","count=58")
Debug [Main]: Stat("EngineWorkerScheduler","count=192")
Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=762,sum=9.44911e+08ns,average=1.24004e+06ns,min=744438ns,max=4.48667e+06ns,variance=3.10929e+11ns^2,stddev=557610ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess output 0, HDA Intel: ALC269 Analog (hw:0,0)","count=763,sum=9.56296e+07ns,average=125334ns,min=22769ns,max=9.20683e+06ns,variance=4.19177e+11ns^2,stddev=647439ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess prepare 0, HDA Intel: ALC269 Analog (hw:0,0)","count=762,sum=1.61254e+09ns,average=2.11619e+06ns,min=1.15643e+06ns,max=1.59874e+07ns,variance=1.52761e+12ns^2,stddev=1.23596e+06ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcessClkRef 0, HDA Intel: ALC269 Analog (hw:0,0)","count=1525")
Debug [Main]: Stat("VsyncThread real time error","count=3,sum=3,average=1,min=1,max=1,variance=0^2,stddev=0")
Debug [Main]: Stat("VsyncThread usleep for VSync","count=1056")
Debug [Main]: Stat("VsyncThread vsync render","count=1062")
Debug [Main]: Stat("VsyncThread vsync swap","count=1062")
Debug [Main]: Stat("WOverview::paintEvent","count=28,sum=1.42161e+07ns,average=507719ns,min=21930ns,max=1.01961e+06ns,variance=6.31043e+10ns^2,stddev=251206ns")
Debug [Main]: Stat("WVuMeter::paintEvent","count=698,sum=4.3623e+07ns,average=62497.1ns,min=38273ns,max=285162ns,variance=5.41059e+08ns^2,stddev=23260.7ns")
Debug [Main]: Stat("WaveformWidgetFactory::render() 2waveforms","count=531,sum=5.20469e+09ns,average=9.80168e+06ns,min=5.27043e+06ns,max=2.03392e+07ns,variance=5.95697e+12ns^2,stddev=2.44069e+06ns")
Debug [Main]: Stat("WaveformWidgetFactory::swap() 2waveforms","count=531,sum=7.55607e+08ns,average=1.42299e+06ns,min=884540ns,max=7.48203e+06ns,variance=4.76785e+11ns^2,stddev=690496ns")
Debug [Main]: =====================================
Debug [Main]: Mixxx shutdown complete with code 0
Commented by: daschuer Date: 2014-12-26T15:02:06Z
There seams to be a mess around SSE2 / SSE3 and Pentium4
See:
http://sourceforge.net/p/lmms/mailman/message/32988535/ "
https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz " Initial steppings of Pentium® 4 processors did not support DAZ "
Commented by: ywwg Date: 2014-12-26T15:31:59Z
Are there any pentium 4 laptops out there? Realistically, how many people might this affect? I would venture to guess ~zero
Commented by: ywwg Date: 2014-12-26T16:02:30Z
I couldn't find a good hardware survey that showed breakdown by processor model. I also googled around to find out which steppings do or do not support DAZ. The pentium4 was in production from 2000-2008, but I'd guess that it's the really really old models that would have trouble with this mode.
This document does show how to detect if the mode is supported, if it comes to that: http://datasheets.chipdb.org/Intel/x86/CPUID/24161817.pdf
We should have at least one build of mixxx that is super-safe 32bit no special flags, just in case. But I'm still leaning toward the default build using DAZ mode.
Commented by: daschuer Date: 2014-12-26T22:36:49Z
Cool, this doc verifies the DAZ issue :-/ It could be a lot of fun to port this dazdetect.asm to gcc and mvc. But I am in doubt if this is worth the time.
For now we have the "portable" build for sse2 cpus with DAZ flag or no flag but not crashing when enabling and the "legacy" build for all older CPU.
Issue closed with status Fix Released.
Reported by: daschuer Date: 2014-12-19T22:52:12Z Status: Fix Released Importance: Medium Launchpad Issue: lp1404401 Attachments: ftz_stats, ftz.patch
It looks like we need to add it, because processing denormals may cost 100 times more CPU.