gnuradio / volk

The Vector Optimized Library of Kernels
http://libvolk.org
GNU Lesser General Public License v3.0
537 stars 201 forks source link

SSE>2 not used when lacking AVX? #562

Closed jmfriedt closed 2 years ago

jmfriedt commented 2 years ago

I have seen that list_cpu_features has been updated to detect SSE>2 SIMD extensions even when the processor is lacking AVX support. The current issue is for a single board computer [1] architectured around an Intel(R) Celeron(R) CPU J1900 as stated with

$ ./cpu_features/list_cpu_features
arch            : x86
brand           :       Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
family          :   6 (0x06)
model           :  55 (0x37)
stepping        :   9 (0x09)
uarch           : INTEL_ATOM_SMT
flags           : clfsh,cx16,cx8,erms,fpu,mmx,movbe,pclmulqdq,popcnt,rdrnd,ss,sse,sse2,sse4_1,sse4_2,ssse3,tsc

Manually compiling VOLK on this platform and running volk_profile leads to e.g.

RUN_VOLK_TESTS: volk_32u_popcntpuppet_32u(131071,1987)
no architectures to test

which sound suspicious since kernels/volk/volk_32u_popcntpuppet_32u.h indicates

#ifdef LV_HAVE_SSE4_2
static inline void volk_32u_popcntpuppet_32u_a_sse4_2(uint32_t* outVector,
...
#endif

Indeed on a desktop computer fitted with an AVX supporting CPU, we observe that

RUN_VOLK_TESTS: volk_32u_popcntpuppet_32u(131071,1987)
generic completed in 550.131 ms
a_sse4_2 completed in 117.743 ms
Best aligned arch: a_sse4_2

Could it be that on an AVX-less CPU (in this case Atom on an embedded board) the SSE4 support is correctly detected but not used in volk_profile method selection?

The function has been included in the dynamic library linking

$ strings ./lib/libvolk.so.2.5.1 | grep se4 | grep pet_32
volk_32fc_s32fc_rotatorpuppet_32fc_u_sse4_1
volk_32fc_s32fc_rotatorpuppet_32fc_a_sse4_1
volk_32u_popcntpuppet_32u_a_sse4_2

but does not seem to be called when selecting the optimum method.

Thanks.

[1] http://advdownload.advantech.com/productfile/Downloadfile1/1-T36L2E/MIO-5251_USER_MANUAL_ED-1_FINAL.PDF (Celeron version)

jdemel commented 2 years ago

This is VOLK's interaction with cpu_features:

static int i_can_has_sse4_1 (void) {
#if defined(CPU_FEATURES_ARCH_X86)
    if (GetX86Info().features.sse4_1 == 0){ return 0; }
#endif
    return 1;
}

I would expect that the output of list_cpu_features contains: sse,sse2,sse3,sse4_1,sse4_2,ssse3 but in your example, it lacks sse3 (not to be confused with ssse3).

What's volk-config-info --avail-machines and --all-machines reporting? If VOLK determines SSE4 to be available, it should look like this:

generic;sse2_64_mmx;sse3_64_mmx;ssse3_64_mmx;sse4_1_64_mmx;sse4_2_64_mmx;

Otherwise, it might stop at sse2_*.

What's the result of cat /proc/cpuinfo | grep flags?

jmfriedt commented 2 years ago

Thank you for your reply. Here are the outputs

$ cat /proc/cpuinfo | grep flags
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer rdrand lahf_lm 3dnowprefetch epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat
vmx flags       : vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest           

so that seeing sse4_1 and sse4_2 made me believe these SIMD instructions would be supported. From VOLK:

$ ./apps/volk-config-info  --all-machines
generic;sse2_64_mmx;sse3_64_mmx;ssse3_64_mmx;sse4_a_64_mmx;sse4_1_64_mmx;sse4_2_64_mmx;avx_64_mmx;avx2_64_mmx;avx512f_64_mmx;avx512cd_64_mmx
$ ./apps/volk-config-info  --avail-machines
generic;sse2_64_mmx;

This CPU is advertised as supporting all SSE if I am to believe https://www.cpu-world.com/CPUs/Celeron/Intel-Celeron%20J1900.html

jdemel commented 2 years ago

It is quite surprising that ssse3 seems to be available but sse3 is supposed to be missing. At least VOLK requires SSE3 to be available to consider SSE4. Since this set of extensions is reported missing, it seems like all kernels beyond SSE2 are unavailable. This is a very interesting specialty of that particular CPU. There might be an issue somewhere.

jmfriedt commented 2 years ago

/proc/cpuinfo fails to show sse3 but shows sse3 as pni (https://bugzilla.redhat.com/show_bug.cgi?id=491817) as does

$ inxi -Fza | grep sse
  Flags: ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx

I fear this detection https://github.com/google/cpu_features/blob/main/src/impl_x86_linux_or_android.c#L47 is incorrect and should replace sse3 with pni from the above correction. A PR has been posted accordingly to https://github.com/google/cpu_features.

jmfriedt commented 2 years ago

fixed in cpu_features with https://github.com/google/cpu_features/commit/40e1c7158ddfbdae477751948750e0121aba55a1

jdemel commented 2 years ago

Thanks for the info! I guess, all we can do is update the submodule pointer and point people to this issue. After all, I expect others to hit the same issue with a system installed cpu_features.