Open JeremyAnsel opened 5 years ago
I got similar results, though I was much less thorough.
But there is some risk that the lookup result is affected by not having realistic content. If one wanted to be really certain it might make sense to have a source image if all identical values and one initialized with random values to see the effect of cache hit rate.
As for the SIMD case I suspect that I chose the constants poorly, as SSE2 has an instruction that does a*b>>16
which has double the throughput of a a*b
that goes from 16 to 32 bit.
Which means that for example (((red * 7967 * 2) >> 16) + 1) >> 1;
might be significantly faster (unless the compiler already figures that out).
Though I am not sure if there actually is a point in optimizing this, I admit I mostly changed the code because MSVC had warnings about the LUT and in general runtime-initialized global variables always come with the risk of falling into the undefined initialization order trap.
Here's an alternative SSE2 version not relying on SIMD. It is a bit wasteful though since it really could do 2 pixels for almost the same price as doing 1:
unsigned short convert32to16_sse2(unsigned int color32)
{
__m128i in = _mm_setr_epi32(color32, 0, 0, 0);
__m128i in16 = _mm_unpacklo_epi8(in, _mm_setzero_si128());
__m128i scaled = _mm_mulhi_epu16(in16, _mm_setr_epi16(7967*2, 16191*2, 7967*2, 0, 0, 0, 0, 0));
__m128i rounded = _mm_avg_epu16(scaled, _mm_setzero_si128());
__m128i shifted = _mm_madd_epi16(rounded, _mm_setr_epi16(1, 32, 32*64, 0, 0, 0, 0, 0));
return _mm_extract_epi16(shifted, 0) | _mm_extract_epi16(shifted, 2);
}
With that function I get:
compare 5 OK
compare 6 OK
bench_lookup: 22.834 ms 4286578691 41943043 4286578688 41943043
bench_compute: 37.7886 ms 4286578691 41943043 4286578688 41943043
bench_compute_simd: 20.393 ms 4286578691 41943043 4286578688 41943043
bench_compute_sse2: 22.9526 ms 4286578691 41943043 4286578688 41943043
bench_colorkey: 13.9993 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_simd: 10.1302 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_skip_lookup: 25.5961 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_skip_compute: 43.5716 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_skip_sse2: 25.8467 ms 4286578691 41943043 4286578688 4267704323
But the results are not very stable.
EDIT: sorry forget all that is written below, it's buggy. Fixed version is still faster, but not as much. Unfortunately I am not getting consistent results for bench_compute_simd at all, anything from 15 to 25 ms... EDIT END
Another variant:
unsigned short convert32to16_sse2_double(unsigned int color32, unsigned color32_2)
{
__m128i in = _mm_setr_epi32(color32, color32_2, 0, 0);
__m128i in16 = _mm_unpacklo_epi8(in, _mm_setzero_si128());
__m128i scaled = _mm_mulhi_epu16(in16, _mm_setr_epi16(7967 * 2, 16191 * 2, 7967 * 2, 0, 7967 * 2, 16191 * 2, 7967 * 2, 0));
__m128i rounded = _mm_avg_epu16(scaled, _mm_setzero_si128());
__m128i shifted = _mm_madd_epi16(rounded, _mm_setr_epi16(1, 32, 32 * 64, 0, 1, 32, 32 * 64, 0));
return _mm_extract_epi16(shifted, 0) | _mm_extract_epi16(shifted, 2) | ((_mm_extract_epi16(shifted, 4) | _mm_extract_epi16(shifted, 6)) << 16);
}
Which gives a good speedup when it can be used:
compare 5 OK
compare 6 OK
bench_lookup: 22.8618 ms 4286578691 41943043 4286578688 41943043
bench_compute: 37.8927 ms 4286578691 41943043 4286578688 41943043
bench_compute_simd: 24.5239 ms 4286578691 41943043 4286578688 41943043
bench_compute_sse2: 22.9949 ms 4286578691 41943043 4286578688 41943043
bench_compute_sse2_double: 13.0858 ms 4286578691 41943043 4290248704 41943043
bench_colorkey: 13.8535 ms 4286578691 41943043 4290248704 4267704323
bench_colorkey_simd: 10.665 ms 4286578691 41943043 4290248704 4267704323
bench_colorkey_skip_lookup: 25.4881 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_skip_compute: 43.5247 ms 4286578691 41943043 4286578688 4267704323
bench_colorkey_skip_sse2: 30.657 ms 4286578691 41943043 4286578688 4267704323
I think the only pretty clear result is that the only thing that really works well is to write the whole inner loop in SSE2... But as far as I know that is already done for anything that actually appears as a hotspot, so in a way performance of this function should not really matter.
I think the return type of convert32to16_sse2_double
should be unsigned int
instead of unsigned short
.
unsigned int convertColorB5G6R5toB8G8R8X8_1(unsigned short color16)
{
unsigned int red = (color16 >> 11) & 0x1f;
unsigned int green = (color16 >> 5) & 0x3f;
unsigned int blue = color16 & 0x1f;
red = (red * (0xFF * 2) + 0x1F) / (0x1F * 2);
green = (green * (0xFF * 2) + 0x3F) / (0x3F * 2);
blue = (blue * (0xFF * 2) + 0x1F) / (0x1F * 2);
unsigned int color32 = (red << 16) | (green << 8) | blue;
return color32;
}
unsigned int convertColorB5G6R5toB8G8R8X8_2(unsigned short color16)
{
unsigned int red = (color16 >> 11) & 0x1f;
unsigned int green = (color16 >> 5) & 0x3f;
unsigned int blue = color16 & 0x1f;
red = (red * 539086 + 32768) >> 16;
green = (green * 265264 + 32768) >> 16;
blue = (blue * 539086 + 32768) >> 16;
unsigned int color32 = (red << 16) | (green << 8) | blue;
return color32;
}
unsigned int convertColorB5G6R5toB8G8R8X8_reimar(unsigned short color16)
{
unsigned red = color16 >> 11;
red |= red << 5;
red >>= 2;
unsigned green = (color16 >> 5) & 0x3f;
green |= green << 6;
green >>= 4;
unsigned blue = color16 & 0x1f;
blue |= blue << 5;
blue >>= 2;
return (red << 16) | (green << 8) | blue;
}
void compare_B5G6R5toB8G8R8X8()
{
std::cout << "compare B5G6R5 to B8G8R8X8";
bool error = false;
for (int i = 0; i < 0x10000; i++)
{
unsigned short c = (unsigned short)i;
unsigned int c1 = convertColorB5G6R5toB8G8R8X8_1(c);
unsigned int c2 = convertColorB5G6R5toB8G8R8X8_2(c);
if (c1 != c2)
{
error = true;
break;
}
}
std::cout << (error ? " ERROR" : " OK") << std::endl;
}
void compare_B5G6R5toB8G8R8X8_reimar()
{
std::cout << "compare B5G6R5 to B8G8R8X8 Reimar";
bool error = false;
for (int i = 0; i < 0x10000; i++)
{
unsigned short c = (unsigned short)i;
unsigned int c1 = convertColorB5G6R5toB8G8R8X8_1(c);
unsigned int c2 = convertColorB5G6R5toB8G8R8X8_reimar(c);
if (c1 != c2)
{
error = true;
break;
}
}
std::cout << (error ? " ERROR" : " OK") << std::endl;
}
I get:
compare B5G6R5 to B8G8R8X8 OK
compare B5G6R5 to B8G8R8X8 Reimar ERROR
void bench_B5G6R5toB8G8R8X8_lookup()
{
unsigned int* dst = g_dstArray32;
unsigned short* src = g_srcArray16;
int size = g_loopCount;
for (int i = 0; i < size; i++)
{
unsigned int color16 = src[i];
dst[i] = g_colorConverterTables.B5G6R5toB8G8R8X8[color16];
}
}
void bench_B5G6R5toB8G8R8X8_compute_1()
{
unsigned int* dst = g_dstArray32;
unsigned short* src = g_srcArray16;
int size = g_loopCount;
for (int i = 0; i < size; i++)
{
dst[i] = convertColorB5G6R5toB8G8R8X8_1(src[i]);
}
}
void bench_B5G6R5toB8G8R8X8_compute_2()
{
unsigned int* dst = g_dstArray32;
unsigned short* src = g_srcArray16;
int size = g_loopCount;
for (int i = 0; i < size; i++)
{
dst[i] = convertColorB5G6R5toB8G8R8X8_2(src[i]);
}
}
void bench_B5G6R5toB8G8R8X8_compute_reimar()
{
unsigned int* dst = g_dstArray32;
unsigned short* src = g_srcArray16;
int size = g_loopCount;
for (int i = 0; i < size; i++)
{
dst[i] = convertColorB5G6R5toB8G8R8X8_reimar(src[i]);
}
}
I get:
bench_B5G6R5toB8G8R8X8_lookup: 15.5249 ms 4267165611 10566150 80301439 3598571874
bench_B5G6R5toB8G8R8X8_compute_1: 76.5755 ms 4267165611 10566150 80301439 3598571874
bench_B5G6R5toB8G8R8X8_compute_2: 15.4619 ms 4267165611 10566150 80301439 3598571874
bench_B5G6R5toB8G8R8X8_compute_reimar: 17.0034 ms 4267165611 10566150 80301439 3447264098
Yes, my replacement for the 565 to 8888 code was not meant to be bit-exact. It is the standard method other programs like FFmpeg etc. use though. It has a < 1 ULP rounding error, which I didn't consider relevant. It might also not be very impressive performance-wise in its C form, but it is relatively easy to convert into SIMD, and more specifically into SSE2 SIMD operating on 8 pixels at once. I do admit I have not checked if the fixed-point multiplication algorithm could give similar SIMD performance though, but at least the constants you selected would not allow for 16x16->high 16 bit multiplication to be used without dropping the rounding. I also admit that when it comes to optimization I tend to be stuck in the past and assume CPUs that are much faster at shifts than multiplies, which is not really true anymore. But the SSE2 code worked for eliminating that function from the profile at least, that much I actually tested :) (the SSE2 code to do colorkeying and conversion is approximately 21 instructions for converting 8 pixels, which even with some loop overhead comes out below 3 instructions per pixel) With the 64kB LUT at least I also think that the input array with values always incrementing by 1 shows it in an unrealistically positive light, since that LUT no longer fits in the L1 cache. But I admit that will depend a lot on the exact image content and CPU used.
Hello, I've run a few benchmark to compare lookup vs compute color conversion.
Here is the result;
I've used this code:
It seems that compute is faster than lookup when SIMD is used, and lookup is faster than compute when SIMD is not used.