scikit-image / scikit-image

Image processing in Python
https://scikit-image.org
Other
6.02k stars 2.22k forks source link

Color conversion code is very slow #1133

Open stefanv opened 10 years ago

stefanv commented 10 years ago

As reported by @holtzhau, the skimage color conversion code is approx. 25x slower than that in OpenCV:

import time
import cv2
from skimage import data, color

image = data.chelsea()

t = time.time()
lab_image = cv2.cvtColor(image, cv2.COLOR_BGR2Lab)
t_cv = time.time() - t

t = time.time()
lab = color.rgb2lab(image)
t_sk = time.time() - t

print('cv:', t_cv)
print('skimage:', t_sk)
print('factor:', t_sk / t_cv)

Results:

cv: 0.00397992134094
skimage: 0.0965170860291
factor: 24.2510034146
juliantaylor commented 10 years ago

major runtime are the powers, cv2 does not seem to us them.

 26.46%  ipython  libm-2.19.so                  [.] __ieee754_pow_sse2
 17.45%  ipython  libm-2.19.so                  [.] __exp1
  5.65%  ipython  libm-2.19.so                  [.] __pow
  4.85%  ipython  libc-2.19.so                  [.] _wordcopy_fwd_aligned
  3.99%  ipython  multiarray.so                 [.] _aligned_strided_to_contig_size8
  3.38%  ipython  [kernel.kallsyms]             [k] clear_page_c
  1.84%  ipython  umath.so                      [.] sse2_binary_scalar2_divide_DOUBLE
  1.71%  ipython  umath.so                      [.] pow@plt
  1.68%  ipython  multiarray.so                 [.] PyArray_MatrixProduct2

numpy should probably add a cbrt specialization for power(x, 1./3.)

blink1073 commented 9 years ago

I'm going to try this Cython function and see what difference it makes (mental note to self):

#cython: cdivision=True
#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
cimport numpy as cnp
from libc.math cimport cbrt, power, sqrt

def power(cnp.float[:, ::1] arr, float exp):
    """
    Raise elements to a given power, element-wise.

    Parameters
    ----------
    arr : ndarray of float
        Input array
    exp : float
        The exponent to which the array elements are raised.

    Returns
    -------
    out : ndarray of float
        Array raised to the given power.
    """
    cdef int i = 0

    if exp == 1 / 3.:
        func = cbrt
    elif exp == 1 / 2.:
        func = sqrt
    else:
        func = power

    for i in range(arr.size):
        arr[i] = func(arr[i])

    return arr
juliantaylor commented 9 years ago

as long as you don'T remove the power cython won't help you at all, fwiw numpy 1.10 has the cbrt function now, but its at most 2 times faster your power function will be slower than numpy for sqrt, numpy's sqrt is vectorized

blink1073 commented 9 years ago

Well then color me confused, because opencv uses pow as well.

juliantaylor commented 9 years ago

seems opencv is building a lookup table to avoid pow calls later. I didn't see pow when I profiled the opencv variant, I used timeit so the lookup table generation is not significant.

blink1073 commented 9 years ago

Now we're getting somewhere!

blink1073 commented 9 years ago

I created a gist with the adapted initLabTabs method.

ahojnnes commented 9 years ago

These LUTs could even be stored as a npy file for further speedup.

blink1073 commented 9 years ago

Here is a more fair comparison:

import time
import cv2
import numpy as np
from skimage import data, color, img_as_float

image = img_as_float(data.chelsea()).astype(np.float32)

t = time.time()
lab_image = cv2.cvtColor(image, cv2.cv.CV_RGB2Lab)
t_cv = time.time() - t

t = time.time()
lab = color.rgb2lab(image)
t_sk = time.time() - t

print 'cv:', t_cv
print 'skimage:', t_sk
print 'factor:', t_sk / t_cv

## -- End pasted text --
cv: 0.0429999828339
skimage: 0.0980000495911
factor: 2.27907183056

Which leads me to believe we should use img_as_ubyte and then use LUTs as OpenCV does.

stefanv commented 9 years ago

I guess we lose quite a bit of accuracy that way. Can we not interpolate the integer tables?

blink1073 commented 9 years ago

I would counter by asking what the true precision is when changing color spaces, but yes, I could use 16bits instead.

blink1073 commented 9 years ago

I'm going cross-eyed trying to figure out why my images aren't coming back the same as the OpenCV ones, I'm going to table this for now (pun intended).

jni commented 9 years ago

I'm going to table this for now (pun intended).

:+1: !

blink1073 commented 9 years ago

Duh, I was not applying the gamma correction. That is why we don't write C++ kids, so hard to follow...

blink1073 commented 9 years ago

Benchmark using the latest gist with 8bit LUT:

import time
from skimage import data, color
import cv2

image = data.chelsea()

t = time.time()
lab = color.rgb2lab(image.copy())
print(time.time() - t)

t = time.time()
lab2 = rgb2lab(image.copy())
print(time.time() - t)

t = time.time()
lab3 = cv2.cvtColor(image.copy(), cv2.COLOR_BGR2Lab)
print(time.time() - t)

0.0789999961853
0.0230000019073
0.0020000934600

~4X faster, but still an order of magnitude slower than OpenCV. The next steps would be to switch to 16bit LUTs, verify the result, and then switch to Cython.

juliantaylor commented 9 years ago

the rgb descale in your gist is just:

descale(np.dot(tab[image], C.reshape(3,3, order='F')), LAB_SHIFT)

though its not faster, could be worthwhile to test einsum instead of dot.

blink1073 commented 9 years ago

I tried einsum, here are the latest results:

0.0800001621246
0.0169999599457
0.00200009346008
jmetz commented 8 years ago

I'm getting a factor of less than 3 slower than cv2 with your gist code (https://gist.github.com/blink1073/b9a22d7110acad78b625) @blink1073 did some other dependency get improved in the meantime??

0.0774719715118
0.0082700252533
0.00317692756653
blink1073 commented 8 years ago

It could be that newer versions of numpy are faster at finding the einsum.

jmetz commented 8 years ago

I'm using Numpy 1.9.2... so fairly recent version, but not bleeding edge. Do you think that's close enough to merit adding your einsum version and closing this issue?

blink1073 commented 8 years ago

I ran it again using Numpy 1.10 on Python 2.7 and got the following:

0.0372121334076
0.00720596313477
0.00090503692627

I still think it is not worth complicating the existing code base.

jni commented 8 years ago

I still think it is not work complicating the existing code base.

@blink1073 can you elaborate on this? Lately I'm getting a bit annoyed at Python performance. I'm ok with staying at say 2x the optimum. But ~10x starts to seem sloppy.

blink1073 commented 8 years ago

I just mean my implementation was not nearly as straightforward as what we currently have. I really do think Cython is the right answer here, I just haven't had a chance to try it.

JDWarner commented 8 years ago

I'm fine with @blink1073's Gist approach. It narrows the gap significantly.

Reality check: is the difference between 0.001 and 0.007 seconds actually going to matter for any real-world use case?

I'm all for enhancing performance, but also strongly believe in the mantra that we should spend out effort wisely. If this is essentially never going to be a limiting factor in a computation, instead of getting bogged down here we should focus our time on riper targets in the package.

jni commented 8 years ago

@JDWarner I don't know whether it's a bottleneck for anyone but since someone reported this, I presume it was at least a minor bottleneck for at least one person. In the scientific domain we often work with enormous images, or millions of images, where that 7x difference does start to add up.

I agree though that effort should go to the biggest bottlenecks, but we don't have a good way to define them.

a-tsukanov commented 5 years ago

This is a major bottleneck for me. I'm writing an app for color correction with real time previewing. Actually instead of real time I wait for color conversion: it takes at least 80% of image correction time. (correction is triggered by mouseReleaseEvent) screenshot from 2018-11-28 00-14-55 Switching to opencv for now

stefanv commented 5 years ago

The current situation is quite untenable; we have approx. 100x slowdown compared against OpenCV.

We don't usually perform optimizations on the instructions set level (which OpenCV does, by using SSE2, etc.), so we won't come out top. But, a factor of 5 is a vast improvement over a factor of 100.

We can probably use lookup tables + linear interpolation to lose very little accuracy and do this for floating point images still.

Anyone interested in taking the lead on improving this feature?

jmetz commented 5 years ago

@stefanv - what happened to the idea of incorporating @blink1073 's gist (https://gist.github.com/blink1073/b9a22d7110acad78b625)? I probably can't commit to doing this myself at the moment... not until maybe over the upcoming holiday season. If no-one else takes it on before then I'll work on it then.

stefanv commented 5 years ago

It's an option. I think we would need a detailed write-up in the code of exactly how it functions, where all the magic values come from, etc. and a study to see whether linear interpolation of table values gets us much higher accuracy or not. This is almost on the scale of a GSoC project to do properly, I think, since it will also require a bunch of testing.

Thomas Mansencal (@KelSolaar) may have comments. Thomas, how fast is colour-science in this regard?

KelSolaar commented 5 years ago

Hi @stefanv,

Colour is very barebone and not optimized for speed at all, I would not expect much differences with any classically vectorised functions. Nobody complained about speed so far on our end, and when it slows down so we had no incentive to do any kind of optimization in that regard. Does not mean we will not do it, it is actually high on my radar, but I'm hoping having Numpy running on GPU down the road will solve that "automagically" at some point. I'm looking eagerly at Bohrium and equivalent initiatives.

Cheers,

Thomas

a-tsukanov commented 5 years ago

Just wonder if using xyz as intermediate space between rgb and lab slows down things by a factor of two. That's the case in the source: def rgb2lab(rgb, illuminant="D65", observer="2"): return xyz2lab(rgb2xyz(rgb), illuminant, observer)

stefanv commented 5 years ago

Perhaps, but that does not explain the other 10x to 20x slowdown!

RGB to XYZ is a linear transformation, but XYZ to Lab is not, so I suspect the latter is more expensive.

See also:

If you want a perceptually uniform space, CAM02UCS is better than Lab, but probably quite a bit more expensive to compute. See @njsmith's library, Colorspacious, at https://colorspacious.readthedocs.io/en/latest/tutorial.html

jmetz commented 5 years ago

Just to point out, I re-ran the gist version, and while it is faster than skimage (v0.14.1), my current opencv is slow.

Timings compared with opencv:
opencv               : 1.0
blink1073's gist     : 0.04939976019070094
Current skimage      : 0.5489196638848151

Current skimage is   : 0.14.1
Current OpenCV is    : 3.4.3

NB lower is better, this is time relative to how long opencv took - admittedly using very basic timing code (ie no timeit).

Did numpy's optimisations in the meantime make it much faster than opencv??

jmetz commented 5 years ago

Also the first line of blink1073's gist is:

    image = img_as_ubyte(image)

which means it's probably not a fair comparison anyway.

jmetz commented 5 years ago

I found that just switching from np.power(..., 1.0/3.0) to np.cbrt(...) doubled the speed of xyz2lab and thereby rgb2lab - seems like a very simple, easy to understand fix for such a performance gain.