llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.82k stars 11.46k forks source link

[SSE2] poor performance with emulated _mm_extract_epi8 #28111

Open llvmbot opened 8 years ago

llvmbot commented 8 years ago
Bugzilla Link 27737
Version 3.8
OS All
Reporter LLVM Bugzilla Contributor
CC @d0k,@RKSimon,@rotateright

Extended Description

I have some code that heavy uses _mm_extract_epi8(). Then I build with clang 3.8 and -msse2 (without -msse4.1) then program work very slow.

To build with GCC and clang 3.4, 3.6, 3.7 I use macro:

ifndef _mm_extract_epi8 / SSE4.1 required. /

define _mm_extract_epi8(xmm, n) \

((_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 1))) & 0xff)

endif

Test results:

AMD Athlon(tm) 5350 APU with Radeon(tm) R3 (2050.04-MHz K8-class CPU) GCC: 20391006000 (SSE4.1) / 20116413000 (SSE2) clang 3.8: 22329895000 (SSE4.1) / 117304135000 (SSE2) !!! clang 3.7: 22367008000 (SSE4.1) / 25542571000 (SSE2) clang 3.6: 22306648000 (SSE4.1) / 25914115000 (SSE2) clang 3.4: 23684031000 (SSE4.1) / 25914115000 (SSE2)

Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (2999.72-MHz K8-class CPU) GCC: 12031595000 (SSE4.1) / 12011303000 (SSE2) clang 3.8: 12431116000 (SSE4.1) / 73035466000 (SSE2) !!! clang 3.7: 12458839000 (SSE4.1) / 13317058000 (SSE2) clang 3.6: 12462181000 (SSE4.1) / 14119683000 (SSE2) clang 3.4: 13555167000 (SSE4.1) / 13178893000 (SSE2)

RKSimon commented 7 years ago

D29841/rL297568 improved v16i8 extraction if its the only use of the vector - multiple extractions still go via a stack spill.

llvmbot commented 8 years ago

Without additional info about the code that's actually being run this bug report isn't useful.

http://www.netlab.linkpc.net/download/software/SDK/core/include/gost3411-2012.h + struct rusage ru_st, ru_end; size_t i, cnt = 10; gost3411_2012_ctx_t ctx; uint8_t digest[GOST3411_2012_HASH_MAX_SIZE]; uint8_t data; size_t data_size = (100 1024 * 1024);

data = malloc((data_size + 32));
memset(data, 0xaa, (data_size + 32));
data += 1;

getrusage(RUSAGE_SELF, &ru_st);
for (i = 0; i < cnt; i ++) {
    gost3411_2012_init(512, &ctx);
    gost3411_2012_update(&ctx, data, data_size);
    gost3411_2012_final(&ctx, digest);
}
getrusage(RUSAGE_SELF, &ru_end);

...

But I made fix in my code:

ifndef __SSE4_1__ / SSE4.1 required. /

undef _mm_extract_epi8 / CLang 3.8 perfomance fix. /

define _mm_extract_epi8(xmm, n) \

(0xff & (_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 0x01))))

endif

comment/delete it to reproduce bug.

RKSimon commented 8 years ago

Looking at this now.

The problem is that we're currently missing pre-SSE41 custom lowering for extractelement i8 - so we end up spilling the vector to stack and loading the byte (which can be particularly costly).

d0k commented 8 years ago

Without additional info about the code that's actually being run this bug report isn't useful.