google / pik

A new lossy/lossless image format for photos and the internet
MIT License
834 stars 51 forks source link

compilation with gcc 6.3 on cpu without avx2 /sse4.2 #47

Closed tudorsuciu closed 5 years ago

tudorsuciu commented 6 years ago

Hello,

I had a lot of trouble finding a version that works on a penryn processor. In the end this commit: 52f2d45cc8e35e45278da54615bb8b11b5066f16 works with this very small patch:

diff --git a/Makefile b/Makefile index 5e210ad..7cebd30 100644 --- a/Makefile +++ b/Makefile -override CXXFLAGS += -std=c++11 -Wall -O3 -fPIC -DSIMD_ENABLE=4 -msse4.2 -maes -I. -I../ -Ithird_party/brotli/c/include/ -Wno-sign-compare +override CXXFLAGS += -std=c++11 -Wall -O3 -fPIC -DSIMD_ENABLE=4 -msse4.1 -I. -I../ -Ithird_party/brotli/c/include/ -Wno-sign-compare

Between the work of dominikhlbg in #8 an this commit, it always compiles(sometimes with small patches) and executes correctly with Intel® Software Development Emulator, like this: ~/Downloads/sde-external-8.16.0-2018-01-30-lin/sde64 -- ./bin/cpik test.png test.pik

After this commit i always get invalid instruction. Could you look what is wrong with next commit? Is a newer gcc needed?

Regards,

jan-wassenberg commented 6 years ago

Hi, thanks for reporting this issue. The next patch is unfortunately very large but convolve.h seems to be the most likely cause. Would you kindly let us know which instruction is faulting (gdb) so we can narrow it down?

tudorsuciu commented 6 years ago

Hello,

Thank you for looking at the issue. I cannot get a proper core file, compilation doesn't work with msse4.1. Intel sde gives this for 98495235a9017b0f5fca7e9963da7f8ed1c2573d (no patches applied): ~/Downloads/sde-external-8.16.0-2018-01-30-lin/sde64 -- ./bin/cpik test3.png out3.pik Compressing with maximum Butteraugli distance 1.000000 Illegal instruction at address = 55e95da9ee9e: 0f 01 f9 48 c1 e2 20 48 09 d0 0f ae e8 48 89 Image name: /home/tudor/pik/bin/cpik Offset in image: 0x31e9e If you believe your application should attempt to execute this illegal instruction (and others that may be present), Then use this knob: -emit-illegal-insts 0 and this error message will be avoided.

If this is not helpful to you, i will try to attach gdb as described in sde documentation to get a proper backtrace(i hope).

Regards,

tudorsuciu commented 6 years ago

I could attach gdb to sde but symbols are not recognized. The backtrace is not usefull. in gdb info function gives: 0x0000000000031190 pik::ButteraugliDistance(pik::Image3 const&, pik::Image3 const&, pik::Image) 0x0000000000031830 pik::ButteraugliDistance(pik::Image3 const&, pik::Image3 const&, pik::Image) 0x0000000000031910 pik::ButteraugliDistance(pik::MetaImage const&, pik::MetaImage const&, pik::Image) 0x0000000000031e90 pik::Zone::~Zone() [clone .constprop.225] 0x00000000000321d0 pik::Image3 pik::Window(pik::Image3, unsigned long, unsigned long, unsigned long, unsigned long) [clone .constprop.219] 0x0000000000032860 pik::CenterOpsinValues(pik::Image3*)

So the invalid instruction happens here: pik::Zone::~Zone() [clone .constprop.225]

I got a 1,6Gb file from sde with all instructions executed. This is the last page: Write (UINT64)0x00007ffe4aca2090 = 0x7ffe4aca3200 INS 0x0000560e4f0d807d BASE nop dword ptr [rax], eax INS 0x0000560e4f0d8080 SSE movaps xmmword ptr [rsp], xmm4 Write (UINT128)0x00007ffe4aca2080 = 00000000_00000000_00000000_00000001 INS 0x0000560e4f0d8084 BASE lea rdi, ptr [rip+0xc6d11] | rdi = 0x560e4f19ed9c INS 0x0000560e4f0d808b BASE call 0x560e4f0d0ea0 | rsp = 0x7ffe4aca2078 Write (UINT64)0x00007ffe4aca2078 = 0x560e4f0d8090 INS 0x0000560e4f0d0ea0 BASE push r15 | rsp = 0x7ffe4aca2070 Write (UINT64)0x00007ffe4aca2070 = 0x50 INS 0x0000560e4f0d0ea2 BASE push r14 | rsp = 0x7ffe4aca2068 Write (UINT64)0x00007ffe4aca2068 = 0x7ffe4aca3670 INS 0x0000560e4f0d0ea4 BASE push r13 | rsp = 0x7ffe4aca2060 Write (UINT64)0x00007ffe4aca2060 = 0x560e4f423ea8 INS 0x0000560e4f0d0ea6 BASE push r12 | rsp = 0x7ffe4aca2058 Write (UINT64)0x00007ffe4aca2058 = 0x7ffe4aca2200 INS 0x0000560e4f0d0ea8 BASE push rbp | rsp = 0x7ffe4aca2050 Write (UINT64)0x00007ffe4aca2050 = 0x560e509a1d40 INS 0x0000560e4f0d0ea9 BASE push rbx | rsp = 0x7ffe4aca2048 Write (UINT64)0x00007ffe4aca2048 = 0x560e509a1880 INS 0x0000560e4f0d0eaa BASE mov rbx, rdi | rbx = 0x560e4f19ed9c INS 0x0000560e4f0d0ead BASE sub rsp, 0x18 | rsp = 0x7ffe4aca2030, rflags = 0x206 Read 0x7f923f7f9080 = (UINT64)0x00007f923f7f9080 INS 0x0000560e4f0d0eb1 BASE mov rax, qword ptr fs:[0x0] | rax = 0x7f923f7f9080 Read 0x560e509a1880 = (UINT64)0x00007f923f7f9078 INS 0x0000560e4f0d0ebd BASE mov rsi, qword ptr [rax-0x8] | rsi = 0x560e509a1880 INS 0x0000560e4f0d0ec4 BASE test rsi, rsi | rflags = 0x202 INS 0x0000560e4f0d0ec7 BASE jz 0x560e4f0d11c0 INS 0x0000560e4f0d0ecd SSE2 lfence INS 0x0000560e4f0d0ed0 BASE rdtsc | rax = 0x37307c1b, rdx = 0x9f7 INS 0x0000560e4f0d0ed2 BASE shl rdx, 0x20 | rdx = 0x9f700000000, rflags = 0x206 INS 0x0000560e4f0d0ed6 BASE or rax, rdx | rax = 0x9f737307c1b, rflags = 0x206 INS 0x0000560e4f0d0ed9 SSE2 lfence Read 0x560e4b19ed4b = (UINT64)0x0000560e509a18e0 INS 0x0000560e4f0d0edc BASE sub rbx, qword ptr [rsi+0x60] | rbx = 0x4000051, rflags = 0x202 INS 0x0000560e4f0d0ee0 BASE mov rbp, 0x1fffffffff | rbp = 0x1fffffffff INS 0x0000560e4f0d0eea BASE and rax, rbp | rax = 0x1737307c1b, rflags = 0x206 INS 0x0000560e4f0d0eed BASE shl rbx, 0x25 | rbx = 0x80000a2000000000, rflags = 0x286 INS 0x0000560e4f0d0ef1 BASE lea rdi, ptr [rbx+rax1] | rdi = 0x80000a3737307c1b Read 0 = (UINT64)0x0000560e509a18c0 INS 0x0000560e4f0d0ef5 BASE mov rax, qword ptr [rsi+0x40] | rax = 0 INS 0x0000560e4f0d0ef9 BASE cmp rax, 0x8 | rflags = 0x293 INS 0x0000560e4f0d0efd BASE jz 0x560e4f0d0f28 INS 0x0000560e4f0d0eff BASE lea rdx, ptr [rax8] | rdx = 0 INS 0x0000560e4f0d0f07 BASE add rax, 0x1 | rax = 0x1, rflags = 0x202 INS 0x0000560e4f0d0f0b BASE mov qword ptr [rsi+rdx1], rdi Write (UINT64)0x0000560e509a1880 = 0x80000a3737307c1b INS 0x0000560e4f0d0f0f BASE mov qword ptr [rsi+0x40], rax Write (UINT64)0x0000560e509a18c0 = 0x1 INS 0x0000560e4f0d0f13 BASE add rsp, 0x18 | rsp = 0x7ffe4aca2048, rflags = 0x206 Read 0x560e509a1880 = (UINT64)0x00007ffe4aca2048 INS 0x0000560e4f0d0f17 BASE pop rbx | rbx = 0x560e509a1880, rsp = 0x7ffe4aca2050 Read 0x560e509a1d40 = (UINT64)0x00007ffe4aca2050 INS 0x0000560e4f0d0f18 BASE pop rbp | rbp = 0x560e509a1d40, rsp = 0x7ffe4aca2058 Read 0x7ffe4aca2200 = (UINT64)0x00007ffe4aca2058 INS 0x0000560e4f0d0f19 BASE pop r12 | r12 = 0x7ffe4aca2200, rsp = 0x7ffe4aca2060 Read 0x560e4f423ea8 = (UINT64)0x00007ffe4aca2060 INS 0x0000560e4f0d0f1b BASE pop r13 | r13 = 0x560e4f423ea8, rsp = 0x7ffe4aca2068 Read 0x7ffe4aca3670 = (UINT64)0x00007ffe4aca2068 INS 0x0000560e4f0d0f1d BASE pop r14 | r14 = 0x7ffe4aca3670, rsp = 0x7ffe4aca2070 Read 0x50 = (UINT64)0x00007ffe4aca2070 INS 0x0000560e4f0d0f1f BASE pop r15 | r15 = 0x50, rsp = 0x7ffe4aca2078 Read 0x560e4f0d8090 = (UINT64)0x00007ffe4aca2078 INS 0x0000560e4f0d0f21 BASE ret | rsp = 0x7ffe4aca2080 INS 0x0000560e4f0d8090 BASE call 0x560e4f0cbe90 | rsp = 0x7ffe4aca2078 Write (UINT64)0x00007ffe4aca2078 = 0x560e4f0d8095 INS 0x0000560e4f0cbe90 BASE push r15 | rsp = 0x7ffe4aca2070 Write (UINT64)0x00007ffe4aca2070 = 0x50 INS 0x0000560e4f0cbe92 BASE push r14 | rsp = 0x7ffe4aca2068 Write (UINT64)0x00007ffe4aca2068 = 0x7ffe4aca3670 INS 0x0000560e4f0cbe94 BASE push r13 | rsp = 0x7ffe4aca2060 Write (UINT64)0x00007ffe4aca2060 = 0x560e4f423ea8 INS 0x0000560e4f0cbe96 BASE push r12 | rsp = 0x7ffe4aca2058 Write (UINT64)0x00007ffe4aca2058 = 0x7ffe4aca2200 INS 0x0000560e4f0cbe98 BASE push rbp | rsp = 0x7ffe4aca2050 Write (UINT64)0x00007ffe4aca2050 = 0x560e509a1d40 INS 0x0000560e4f0cbe99 BASE push rbx | rsp = 0x7ffe4aca2048 Write (UINT64*)0x00007ffe4aca2048 = 0x560e509a1880 INS 0x0000560e4f0cbe9a BASE sub rsp, 0x18 | rsp = 0x7ffe4aca2030, rflags = 0x206 SIG signal=0x4 on thread 0 at address 0x560e4f0cbe9e FATALSIG4

Do this pieces of information help or more is needed?

jan-wassenberg commented 6 years ago

Thank you for this information, it is very helpful and points us straight to the problem and a solution! The faulting instruction is RDTSCP, used for timing especially in profiler.h. I didn't realize this but Penryn indeed lacks that instruction.

A quick fix for the short term would be to change rdtscp to rdtsc in the inline asm in tsc_timer.h. BTW we no longer use the 32-bit version (wraps around too quickly) so I can simplify the interface. I'll also change cpik/dpik to use the OS timer instead. After that, you'll be able to change #if PROFILER_ENABLED in profiler.h to #if PROFILER_ENABLED && 0 and this problem should be resolved.

tudorsuciu commented 6 years ago

With this patch:

diff --git a/tsc_timer.h b/tsc_timer.h index b7783dd..36a2fee 100644 --- a/tsc_timer.h +++ b/tsc_timer.h @@ -125,13 +125,13 @@ inline uint64_t Stop<uint64_t>() { #elif PIK_ARCH_X64 && PIK_COMPILER_MSVC PIK_COMPILER_FENCE; unsigned aux; - t = __rdtscp(&aux); + t = __rdtsc(&aux); SIMD_NAMESPACE::load_fence(); PIK_COMPILER_FENCE; #elif PIK_ARCH_X64 && (PIK_COMPILER_CLANG || PIK_COMPILER_GCC) // Use inline asm because __rdtscp generates code to store TSC_AUX (ecx). asm volatile( - "rdtscp\n\t" + "rdtsc\n\t" "shl $32, %%rdx\n\t" "or %%rdx, %0\n\t" "lfence"

It compiles and executes correctly (at least on small test images). With this code: static inline void native_cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx) { /* ecx is often an input as well as an output. */ asm volatile("cpuid" : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx) : "0" (*eax), "2" (*ecx)); } #include <stdio.h> int main(int argc, char **argv) { unsigned eax, ebx, ecx, edx; eax = 1; /* processor info and feature bits */ native_cpuid(&eax, &ebx, &ecx, &edx); printf("stepping %d\n", eax & 0xF); printf("model %d\n", (eax >> 4) & 0xF); printf("family %d\n", (eax >> 8) & 0xF); printf("processor type %d\n", (eax >> 12) & 0x3); printf("extended model %d\n", (eax >> 16) & 0xF); printf("extended family %d\n", (eax >> 20) & 0xFF); eax = 0x80000001; native_cpuid(&eax, &ebx, &ecx, &edx); printf("rdtscp %d\n", (edx >> 27) & 0x1); } I get: stepping 10 model 7 family 6 processor type 0 extended model 1 extended family 0 rdtscp 0

So rdtscp is not supported for penryn processor if this example code is ok. It means that Intel sde does not emulate/simulate correctly this instruction. It does seem to add the flag: ~/Downloads/sde-external-8.16.0-2018-01-30-lin/sde64 -- ./a.out stepping 0 model 6 family 6 processor type 0 extended model 6 extended family 0 rdtscp 1

Would it be possible for you to fix the compilation errors that appear with a patched makefile like this? diff --git a/Makefile b/Makefile index 53ef001..3c58028 100644 --- a/Makefile +++ b/Makefile @@ -1,4 +1,4 @@ -SIMD_FLAGS := -DSIMD_ENABLE=4 -msse4.2 -maes +SIMD_FLAGS := -DSIMD_ENABLE=4 -msse4.1 override CXXFLAGS += -std=c++11 -Wall -O3 -fPIC -I. -I../ -Ithird_party/brotli/c/include/ -Wno-sign-compare override LDFLAGS += -lpthread A solution to compile without sse would be even better for generic debug and to run the code on other architectures (arm64).

tudorsuciu commented 6 years ago

I wanted to see what sse4.2 instruction is used, so i deactivated the exit after "Cannot continue because CPU lacks SSE4 support". There is no instruction used in cpik/dpik that needs sse4.2. The only intrinsic that i found in simd library is: mm_cmpgt_epi64 In my case the problem is fixed, i was able (finally) to check/evaluate the quality of pik on big images. I don't know if the issue should be closed, it depends if there is somebody else impacted.

Regards,

jan-wassenberg commented 6 years ago

Congratulations, happy to hear you got it working! The timers are fixed and will be in the next push. Looks like reducing from SSE4.2 to SSE4.1 would cover about 1.4% more systems according to Steam Hardware survey. That's surprising and possibly worthwhile, I'll add it to the backlog. Thanks for suggesting and investigating this!

jan-wassenberg commented 6 years ago

Hello again, we've fixed this locally (removed cmpgt_epi64 and aes), so hopefully it would work in Penryn after the next push. Thanks again for suggesting and investigating this!