Speed up resampling - Githubissues

riaanvddool commented 13 years ago

I have some c++ code for cubic convolution (CC) resampler. Will try my hand at adding a c++ CC resampler using ctypes.

nfaggian commented 13 years ago

I am keen to help with this - I think that the nearest neighbour sampler is probably a good starting point.

riaanvddool commented 13 years ago

This is the code I will be using as a reference:

int QgsImageAligner::cubicConvolution(double *array, QgsComplex point) { double xShift = point.real(); double yShift = point.imag();

double _xArray = new double[4]; double *yArray = new double[4]; xArray[0] = -(1/2.0)_pow(xShift, 3) + pow(xShift, 2) - (1/2.0)_xShift; xArray[1] = (3/2.0)_pow(xShift, 3) - (5/2.0)_pow(xShift, 2) + 1; xArray[2] = -(3/2.0)_pow(xShift, 3) + 2_pow(xShift, 2) + (1/2.0)_xShift; xArray[3] = (1/2.0)_pow(xShift, 3) - (1/2.0)_pow(xShift, 2);

yArray[0] = -(1/2.0)_pow(yShift, 3) + pow(yShift, 2) - (1/2.0)_yShift; yArray[1] = (3/2.0)_pow(yShift, 3) - (5/2.0)_pow(yShift, 2) + 1; yArray[2] = -(3/2.0)_pow(yShift, 3) + 2_pow(yShift, 2) + (1/2.0)_yShift; yArray[3] = (1/2.0)_pow(yShift, 3) - (1/2.0)*pow(yShift, 2);

int result = 0; for(int y = 0; y < 4; y++) { int rowContribution = 0; for(int x = 0; x < 4; x++) { rowContribution += xArray[x] * array[y4 + x]; } result += rowContribution \ yArray[y]; }

delete [] xArray; delete [] yArray;

return result; }

riaanvddool commented 13 years ago

Could probably be sped up by unrolling the for loops and getting rid of the memory allocation...

stefanv commented 13 years ago

Note that C++ compilers obfuscate function names to the extent that they cannot be used with ctypes, so you'll have to place the code in "extern C" clauses. Otherwise, the same code can also be wrapped cleanly using Cython.

stefanv commented 13 years ago

Another thing about the code given above is that it may cause more memory churning than is necessary. So, if speed becomes a problem, it may be good to load sub-sections of the image into a local buffer, work on that, and then proceed to load another. If this is a straight-forward convolution (I wasn't sure), it may be best to use a buffered implementation already written, such as the convolution in ndimage.

nfaggian commented 13 years ago

Interesting code snippet.

We should have a test (tests/test_samplers.py) that uses cprofile to really identify how bad things are.

This is a qualitative but nice test:

for each sampler: for n in 100:1024 1) form an n by n image 2) form a random n by n warp field 3) store the time

Plot sampler time vs image size - would be really nice to see.

riaanvddool commented 13 years ago

Here is my (optimized) CC code:

        di = (int)floor(warp[0][i][j]);
        dj = (int)floor(warp[1][i][j]);

        if ( ( di < rows-2 && di >= 2 ) &&
             ( dj < cols-2 && dj >= 2 ) )
        {
            xShift = warp[1][i][j] - dj;
            yShift = warp[0][i][j] - di; 
            xArray0 = -(1/2.0)*pow(xShift, 3) + pow(xShift, 2) - (1/2.0)*xShift;
            xArray1 = (3/2.0)*pow(xShift, 3) - (5/2.0)*pow(xShift, 2) + 1;
            xArray2 = -(3/2.0)*pow(xShift, 3) + 2*pow(xShift, 2) + (1/2.0)*xShift;
            xArray3 = (1/2.0)*pow(xShift, 3) - (1/2.0)*pow(xShift, 2);
            yArray0 = -(1/2.0)*pow(yShift, 3) + pow(yShift, 2) - (1/2.0)*yShift;
            yArray1 = (3/2.0)*pow(yShift, 3) - (5/2.0)*pow(yShift, 2) + 1;
            yArray2 = -(3/2.0)*pow(yShift, 3) + 2*pow(yShift, 2) + (1/2.0)*yShift;
            yArray3 = (1/2.0)*pow(yShift, 3) - (1/2.0)*pow(yShift, 2);                
            c0 = xArray0 * image[di-1][dj-1] + xArray1 * image[di-1][dj+0] + xArray2 * image[di-1][dj+1] + xArray3 * image[di-1][dj+2];
            c1 = xArray0 * image[di+0][dj-1] + xArray1 * image[di+0][dj+0] + xArray2 * image[di+0][dj+1] + xArray3 * image[di+0][dj+2];
            c2 = xArray0 * image[di+1][dj-1] + xArray1 * image[di+1][dj+0] + xArray2 * image[di+1][dj+1] + xArray3 * image[di+1][dj+2];
            c3 = xArray0 * image[di+2][dj-1] + xArray1 * image[di+2][dj+0] + xArray2 * image[di+2][dj+1] + xArray3 * image[di+2][dj+2];
            result[i][j] = c0 * yArray0 + c1 * yArray1 + c2 * yArray2 + c3 * yArray3;
        }
        else
        {
            result[i][j] = 0.0;
        }

riaanvddool commented 13 years ago

I agree with Stefan's observation about memory use. The memory usage is quite high already especially when you dont downscale the lenna image. Not sure if we have to bother with a tiling scheme for now, but lets just keep memory usage in mind...

riaanvddool commented 13 years ago

I am ready to commit and push the CC sampler, but I forgot to branch before starting. Would I be forgiven for pushing this to master for now? Shouldnt break anything else.

riaanvddool commented 13 years ago

Speed seems to be quite good! Worst case is about 2x NN but still much faster than spline:

Nearest : 1920x1920 image - 118.870 ms Cubic : 1920x1920 image - 251.761 ms Spline : 1920x1920 image - 2085.687 ms

nfaggian commented 13 years ago

This is great Riaan. If we had a branch we could do a code review - but this time I think its safe as long as there is a test ;-)

riaanvddool commented 13 years ago

Ok, i have pushed to master for now...

pyimreg / python-register

Speed up resampling #13