Open llvmbot opened 8 years ago
D29841/rL297568 improved v16i8 extraction if its the only use of the vector - multiple extractions still go via a stack spill.
Without additional info about the code that's actually being run this bug report isn't useful.
http://www.netlab.linkpc.net/download/software/SDK/core/include/gost3411-2012.h + struct rusage ru_st, ru_end; size_t i, cnt = 10; gost3411_2012_ctx_t ctx; uint8_t digest[GOST3411_2012_HASH_MAX_SIZE]; uint8_t data; size_t data_size = (100 1024 * 1024);
data = malloc((data_size + 32));
memset(data, 0xaa, (data_size + 32));
data += 1;
getrusage(RUSAGE_SELF, &ru_st);
for (i = 0; i < cnt; i ++) {
gost3411_2012_init(512, &ctx);
gost3411_2012_update(&ctx, data, data_size);
gost3411_2012_final(&ctx, digest);
}
getrusage(RUSAGE_SELF, &ru_end);
...
But I made fix in my code:
(0xff & (_mm_extract_epi16(__xmm, ((__n) >> 1)) >> (8 * ((__n) & 0x01))))
comment/delete it to reproduce bug.
Looking at this now.
The problem is that we're currently missing pre-SSE41 custom lowering for extractelement i8 - so we end up spilling the vector to stack and loading the byte (which can be particularly costly).
Without additional info about the code that's actually being run this bug report isn't useful.
Extended Description
I have some code that heavy uses _mm_extract_epi8(). Then I build with clang 3.8 and -msse2 (without -msse4.1) then program work very slow.
To build with GCC and clang 3.4, 3.6, 3.7 I use macro:
ifndef _mm_extract_epi8 / SSE4.1 required. /
define _mm_extract_epi8(xmm, n) \
endif
Test results:
AMD Athlon(tm) 5350 APU with Radeon(tm) R3 (2050.04-MHz K8-class CPU) GCC: 20391006000 (SSE4.1) / 20116413000 (SSE2) clang 3.8: 22329895000 (SSE4.1) / 117304135000 (SSE2) !!! clang 3.7: 22367008000 (SSE4.1) / 25542571000 (SSE2) clang 3.6: 22306648000 (SSE4.1) / 25914115000 (SSE2) clang 3.4: 23684031000 (SSE4.1) / 25914115000 (SSE2)
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz (2999.72-MHz K8-class CPU) GCC: 12031595000 (SSE4.1) / 12011303000 (SSE2) clang 3.8: 12431116000 (SSE4.1) / 73035466000 (SSE2) !!! clang 3.7: 12458839000 (SSE4.1) / 13317058000 (SSE2) clang 3.6: 12462181000 (SSE4.1) / 14119683000 (SSE2) clang 3.4: 13555167000 (SSE4.1) / 13178893000 (SSE2)