ilia3101 / MLV-App

All in one MLV processing app.
https://mlv.app/
GNU General Public License v3.0
272 stars 29 forks source link

RCD demosaic #223

Closed heckflosse closed 2 years ago

heckflosse commented 3 years ago

Hi, I stumbled upon MLV-App by a PR @SimonSegerblomRex made for RawTherapee.

Now I wonder, why you don't use RCD demosaic. It's on par with Amaze quality wise while being much faster.

It's available in RawTherapee, darktable (dev), siril, Filmulator and ART.

The easiest way to use it would be to use https://github.com/CarVac/librtprocess which also includes the ca_correct code which allows more than one iteration.

Just an idea ;-)

Ingo

masc4ii commented 3 years ago

Hi, thanks for such ideas... that's very welcome. I did a quick add https://github.com/ilia3101/MLV-App/commit/9c31863850f305988c70af61281bd867b43f3307

But unfortunately it isn't so easy to use your lib as is... maybe I should create a C wrapper for it, because our processing code is mainly in C. But I really like the interface!

What about the result of ca_correct - does it look different than earlier versions? In our project I don't use it very much, because the result looks mostly worse than without using it.

Yes, RCD seems to be very close to AMaZE quality wise. But here RCD is a little slower. But my computer isn't the fastest anyway and probably not made for benchmarking such algorithms ;)

Anyway. Thanks again for the idea!

heckflosse commented 3 years ago

What about the result of ca_correct - does it look different than earlier versions?

For 99% of the cases two iterations are better than one. There are very rare cases where up to 5 iterations are needed and also very rare cases where two iterations are not better than one.

But here RCD is a little slower

That's strange. Here RCD is about twice as fast than amaze

Edit: btw: I also detected, that the librtprocess rcd code is slower than the RawTherapee rcd code (though I have no idea why this is the case). Maybe try the rcd code from RT...

I can also provide comparisons between RT amaze code and rcd code concerning performance (if needed)

heckflosse commented 3 years ago

Edit; the rcd code needs to be compiled using -O3 to vectorize...

heckflosse commented 3 years ago

maybe I should create a C wrapper for it, because our processing code is mainly in C. But I really like the interface!

Feel free to create a PR for librtpocess with a C wrapper :+1:

masc4ii commented 3 years ago

Feel free to create a PR for librtpocess with a C wrapper 👍

I tried to start that, but I don't get librtprocess compiled on my system. Maybe macOS Mavericks is too old. But it is the currently minimum supported by MLVApp. The compiler don't knows < algorithm > and other includes...

heckflosse commented 3 years ago

I tried to start that, but I don't get librtprocess compiled on my system. Maybe macOS Mavericks is too old. But it is the currently minimum supported by MLVApp. The compiler don't knows < algorithm > and other includes...

You could also try the rt version

heckflosse commented 3 years ago

The compiler don't knows < algorithm > and other includes...

Update your requirements ;-)

heckflosse commented 3 years ago

And, don't give up. Especially for video, demosaic speed is important in my opinion, as there a lot of frames have to be demosaiced. I will help on this :+1:

masc4ii commented 3 years ago

Yes, that's absolutely right. Speed is so important! Feel so stupid to ask... but what is required to make the compiler happy in terms of those C++ libs? Does it need C++20? Then I am lost on this old platform 😄

masc4ii commented 3 years ago

Found it... "sleefsseavx.c" must have a cpp ending... then the compiler is happy. 😄 ... progress...

DeafEyeJedi commented 3 years ago

Oh yeah baby I am feeling this one....

Sent from my iPhone Xr

On Mar 29, 2021, at 11:51 AM, masc4ii @.***> wrote:

Found it... "sleefsseavx.c" must have a cpp ending... then the compiler is happy. 😄 ... progress...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

heckflosse commented 3 years ago

Hmm, I wonder why you need sleefsseavx.c for rcd demosaic ...

heckflosse commented 3 years ago

btw: the darktable implementation is plain C

https://github.com/darktable-org/darktable/blob/master/src/iop/rcd_demosaic.c

masc4ii commented 3 years ago

I yesterday did my own changings on top of your librtprocess rcd version. This works nicely in MLVApp now.

Now - next thing I would like to try: using librtprocess instead of all our own code. For compiling the entire MLVApp in one step automatically, I would like to integrate librtprocess as is plus wrapper into MLVApp. And the librtprocess code doesn't compile on macOS with llvm-clang compiler, until you rename your included sleefsseavx.c to .cpp. With that strategy I succeeded last night to use librtprocess lmmse, amaze, rcd and igv. Amaze looks a little strange: there are many vertical (mostly pink) artifacts in the picture - but not in all clips I tested. No such problem with our Amaze version. The other debayers work as expected so far.

heckflosse commented 3 years ago

I pushed a change to librtprocess which should fix your compilation issue

https://github.com/CarVac/librtprocess/commit/838f25c4a16afd0ead69e496615c960f7c115c22

heckflosse commented 3 years ago

Amaze looks a little strange: there are many vertical (mostly pink) artifacts in the picture - but not in all clips I tested

can you show a screenshot?

heckflosse commented 3 years ago

About the speed difference between Amaze and RCD

I measured a 45.6 MP file on my old AMD FX8350

Amaze: ~550 ms RCD: ~260 ms

That was using gcc 10.2 on Windows/msys2

I know that using clang RCD is slower. Maybe we can find the reason with your help ;-)

masc4ii commented 3 years ago

I can do a screenshot later.

On Win10, mingw64, i7-4770, 1856x1044 I get in MLVApp (most features disabled):

On macOS 10.14, llvm clang, MBPR 2013 13", i5, same clip, same settings:

On macOS 10.15, llvm clang, MBPR 2013 15", i7, same clip, same settings:

Suer, I don't just measure the time for debayering. But the entire processing is identical. I just switched the debayer.

heckflosse commented 3 years ago

Hmm, strange... was rcd compiled using -O3 in your Win10/mingw test?

Edit: for good speed of rcd demosaic (at least when using gcc) you need to use -O3 or -ftree-vectorize

As an alternative you can also use (that's what darktable does)

#ifdef __GNUC__
  #pragma GCC push_options
  #pragma GCC optimize ("fast-math", "fp-contract=fast") 
#endif

at the beginning of the rcd source and

// revert rcd specific aggressive optimizing 
#ifdef __GNUC__
  #pragma GCC pop_options
#endif

at the end

masc4ii commented 3 years ago

Here comes the screenshot of the faulty AMaZE (zoom in on and around the white boat, there are vertical artifacts, which are not there normally): M23-1205_frame_1

And this is the difference to our AMaZE algorithm (I added some contrast to make it more visible): M23-1205_frame_1

Btw: maybe some parameters are wrong and cause this. What should I do with initGain, inputScale and outputScale? I set this all to 1.0 for testing.

I tried all what you wrote (-O3/-ftree-vectorize/the pragmas) on your/my changed RCD code only for now. It makes no difference at all. Maybe I did something wrong when adapting it to our C and MLVApp concept and it is broken for being accelerated. CPU is at 35% only when using RCD. For AMaZE it is up to 85%. Both % values for the Windows version. On macOS it is always at 100%. --> Maybe I can test the entire librtprocess in the next days. We'll see if it works there.

heckflosse commented 3 years ago

CPU is at 35% only when using RCD

That sounds really wrong. What is the chunkSize value you used?

masc4ii commented 3 years ago

I set the chunkSize to the number of threads I can drive. But also here: if I set it to 1 or 2 or 8 doesn't change anything.

heckflosse commented 3 years ago

I set the chunkSize to the number of threads I can drive.

Usually the best value for chunkSize is 2. For really large raw files a value up to 6 can be better. The chunkSize is just a cache thing. I will explain it later.

heckflosse commented 3 years ago

Anyway, as you tried out several values, it must be something else...

heckflosse commented 3 years ago

which gcc version are you using on Windows?

masc4ii commented 3 years ago

which gcc version are you using on Windows?

I think it is Mingw64 7.3.0.

heckflosse commented 3 years ago

ping @jenshannoschwalm

Maybe he has an idea. We collaborated to improve rdc speed...

heckflosse commented 3 years ago

I think it is Mingw64 7.3.0.

what does gcc --version spit out?

masc4ii commented 3 years ago

It is: gcc (x86_64-posix-seh-rev0, Built by MinGW-W64-project) 7.3.0

masc4ii commented 3 years ago

Oh nice... new benchmark: On macOS 10.14, llvm clang, MBPR 2013 13", i5, same clip, same settings as above:

Old MLVApp from yesterday, our old AMaZE and the adapted RCD:

MLVApp with librtprocess and C wrapper, so both librtprocess' AMaZE and RCD:

heckflosse commented 3 years ago

It is: gcc (x86_64-posix-seh-rev0, Built by MinGW-W64-project) 7.3.0

Why such an old gcc version? msys2 is on 10.2 already

masc4ii commented 3 years ago

It is the standard which comes with Qt. We compile the entire App with the help of Qt.

heckflosse commented 3 years ago

For Windows builds we (RawTherapee) always rely on what pacman -Syuu provides us. Currently it's gcc 10.2

jenshannoschwalm commented 3 years ago

mmm,

  1. about the AMaZE artefacts, we saw something like this in dt when using fastmath
  2. the dt clang rcd code is much slower here too - not understodd why so far. While looking at the gaussian 9x9 blurs there are these pragmas for the loop
  for(int row = 4; row < height - 4; row++)
  {
#if defined(__clang__)
        #pragma clang loop vectorize(assume_safety)
#elif defined(__GNUC__)
        #pragma GCC ivdep
#endif
    for(int col = 4; col < width - 4; col++)
    {
      const int i = row * width + col;
.......
masc4ii commented 3 years ago

The AMaZE problem is influenced by the tilesize: If I set ts=992, all artifacts are gone (and it is slow). ts=96 produces even more artifacts 😄

heckflosse commented 3 years ago

In RawTherapee we use ts = 160 for years without problems

masc4ii commented 3 years ago

Correct. But I just can tell you what I see. Our old AMaZE algorithm work just fine; same for other algorithms from your library. Who knows why...

In AHD I also get some very small artifacts which I don't get with our old AHD. It mostly looks identical, but look the blue and yellow dots on the water (in our old AHD implementation this is just smooth, same for AMaZE, RCD and others): Bildschirmfoto 2021-03-31 um 18 59 42

On more question: what must I do to use the auto feature in CA_correct? And what are those fitParams? All I can get to work (at least I see any difference) is the cared and cablue setting - but it is very hard to make CAs invisible with that. A cyan edge will always look a little cyan.

Uncorrected: Bildschirmfoto 2021-03-31 um 19 05 16

Corrected: Bildschirmfoto 2021-03-31 um 19 05 28

A different CA postprocessing algorithm: Bildschirmfoto 2021-03-31 um 19 06 30

masc4ii commented 3 years ago

The file renaming fix you did: it works fine now. Many thanks!

heckflosse commented 3 years ago

what must I do to use the auto feature in CA_correct?

set autoCA to true and autoIterations to a value > 0, preferrably 2

heckflosse commented 3 years ago

And what are those fitParams?

You don't need to set them. They will be passed back from the auto ca correction in case you want to use the same values for several frames (for example all frames of a pixel shift raw). Just set fitParamsIn to false for non pixelshift files.

masc4ii commented 3 years ago

set autoCA to true and autoIterations to a value > 0, preferrably 2

Okay... that also is what I already thought and tried. Unfortunately all edges get extremely green. Bildschirmfoto 2021-03-31 um 19 36 25

Measure gives this output: CA correcting 1488x1900 image with 2 tiles per thread numblox = 0 CA correction took 136 ms

masc4ii commented 3 years ago

In the end, this should be the call for testing (anamorphic picture, unsquezzed in post):

CA_correct( 0, 0, 1488, 1900, true, 2, 0.0, 0.0, false, rawData, rawData, cfa, sPC, fitParams, false, 65535.0f, 65535.0f, 2, true );

cfa = {{0,1},{1,2}}

Is it a problem, that in=out? I just saw the comment: "for CA_correct rawDataIn and rawDataOut may point to the same buffer."

jenshannoschwalm commented 3 years ago

back to the demosaicers for a moment. May be it's not the point but the artefacts you observe are related to the tiling borders.

While we did the rcd pull request it became obvious that other code had such problems that i missed first. The problem was fast-math working fine with rcd but not with amaze.

I dont think that the tilesize shuld matter at all. It's just about performance and cache fitting. BUT fast-math makes assumptions which are at least not met for amaze.

masc4ii commented 3 years ago

If someone else like to play with the current state... here a commit for testing: https://github.com/ilia3101/MLV-App/commit/2604ccae55205ca3d816249edc4154eab538748f

Description of what works so far can be found in the commit message.