mm2 / Little-CMS

A free, open source, CMM engine. It provides fast transforms between ICC profiles.
https://www.littlecms.com
MIT License
549 stars 174 forks source link

Eval4Inputs optimisation #422

Closed glennwilton closed 8 months ago

glennwilton commented 9 months ago

Hi

I might be wrong, I don't really know C well but I noticed something , in Eval4Inputs since the CMYK lut is arranged so that K0 and K1 are the last items, they are right next to each other in the LUT (or offset by a constant K1), You don't need to recalculate the position again, K1 its basically the next 4 channels in the LUT array

since rx,ry,rz are not dependent on rk don't need to redo the nested ifs

you can just

if (rx >= ry && ry >= rz) {

c1 = DENS(X1, Y0, Z0) - c0; c2 = DENS(X1, Y1, Z0) - DENS(X1, Y0, Z0); c3 = DENS(X1, Y1, Z1) - DENS(X1, Y1, Z0);

LutTable += K1;

d1 = DENS(X1, Y0, Z0) - c0; d2 = DENS(X1, Y1, Z0) - DENS(X1, Y0, Z0); d3 = DENS(X1, Y1, Z1) - DENS(X1, Y1, Z0);

then

Rest1 = c1 rx + c2 ry + c3 rz; Rest2 = d1 rx + d2 ry + d3 rz; Output[i] = LinearInterp(rk, Rest1, Rest2);

Also, you don't need to do DENS twice as K is linear in memory

p = X1+Y0+Z0+OutChan (Position in memory) c1 = LutTable[p] (get K0) p1 += TotalOut (the number of bytes to next data ... should be nChannels * number of bits? It's a constant I think.) d1 = LutTable[p] (Get K1)

However in this case you will need to check if K1 > K0, K0 == K1 or rk==0 then you can just do the first loop of the same function and no need to interpolate Rest1 / Rest2 as values will be identical of rk=0 second value is multiplied by 0 and makes no change.

It saves just a few is and some additions but since CMYK > RGB is common it might help even if a few percent faster

mm2 commented 9 months ago

Thanks @glennwilton The current code is such because it has been tested for speed in many optimizing compilers and this variant runs fast. The two loops are vectorized in some cases. I choose to keep this code for clarity sake. Also, I found the whole tetrahedra partition has to be recomputed because sometimes is different when K advances. Assembly code produced by optimizer is quite different from C, so those additions probably are not there.

If you want to provide a PR, I will be glad to run it across an extensive throughput test bed I keep for such cases.

glennwilton commented 9 months ago

Thanks, I did not consider compiler optimisations, which negate the complexity; I'm using a similar function in Javascript on 3D/4D Luts and noticed the optimisations that make a big difference in JS, as I need all the optimisations I can get!!

mm2 commented 9 months ago

Yep, JS is a quite different thing. You can try what modern optimizers does in C here:

https://godbolt.org/

I tried this code with -Ofast as options:

int test()
{
    int a = 0;

    for (int i=0; i < 10; i++)
    {   
        a += i*2;
    }

    for (int j=0; j < 10; j++)
    {   
        a -= j/3;
    }

return a;
}

And the compiler generated this assembly:


test():
        mov     eax, 78
        ret
glennwilton commented 9 months ago

That is interesting. It has been a long time since I did anything In C.

Out of interest are there any utils for testing the throughput performance of LittleCMS, I'm curious to see the speed difference between LCMS and what I've done in JS. My C is very rusty, but I suspect I could modify transicc.

On my Ryzen7 3700X in Javascript (all single Core/ No threads) using a prebuilt RGB->CMYK LUT 33x33x33 with 64bit float values , and 8bit input/out arrays , Chrome is pushing through 39 million pixels per second, and Firefox 36 million pixels per second. Fast enough for what I need but interesting to see what C code can pump though.

mm2 commented 8 months ago

No defect was described here, so I close the issue