ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
561 stars 378 forks source link

Build failure with PSM3 #6622

Closed jsquyres closed 3 years ago

jsquyres commented 3 years ago

On master/HEAD with gcc 8.2.0:

$ make
make  all-am
make[1]: Entering directory `/home/jsquyres/git/libfabric'
  CC       prov/psm3/psm3/opa/libopa_la-opa_dwordcpy-x86_64.lo
/tmp/ccaiZl9L.s: Assembler messages:
/tmp/ccaiZl9L.s:267: Error: no such instruction: `vinserti128 $0x1,16(%rsi),%ymm3,%ymm1'
/tmp/ccaiZl9L.s:268: Error: no such instruction: `vinserti128 $0x1,48(%rsi),%ymm4,%ymm0'
make[1]: *** [prov/psm3/psm3/opa/libopa_la-opa_dwordcpy-x86_64.lo] Error 1
make[1]: Leaving directory `/home/jsquyres/git/libfabric'
make: *** [all] Error 2

Here's the output from configure:

158 configure: *** Configuring psm3 provider
159 checking sys/mman.h usability... yes
160 checking sys/mman.h presence... yes
161 checking for sys/mman.h... yes
162 looking for library without search path
163 checking for shm_open in -lrt... yes
164 checking numa.h usability... yes
165 checking numa.h presence... yes
166 checking for numa.h... yes
167 looking for library without search path
168 checking for numa_node_of_cpu in -lnuma... yes
169 checking infiniband/verbs.h usability... yes
170 checking infiniband/verbs.h presence... yes
171 checking for infiniband/verbs.h... yes
172 looking for library without search path
173 checking for ibv_get_device_list in -libverbs... yes
174 checking for -msse4.2 support... yes
175 checking for -mavx support... yes
176 checking rdma/rv_user_ioctls.h usability... no
177 checking rdma/rv_user_ioctls.h presence... no
178 checking for rdma/rv_user_ioctls.h... no
179 checking rv/rv_user_ioctls.h usability... no
180 checking rv/rv_user_ioctls.h presence... no
181 checking for rv/rv_user_ioctls.h... no
182 configure: psm3 provider: include in libfabric

Let me know if you need additional information.

acgoldma commented 3 years ago

Can you give more info about the build system? (OS/kernel/CPU/gcc pre-built or custom build)?

I know RHEL 8.1 has gcc 8.3.1 which will build this just fine.

vinserti128 is an avx2 instruction set. Could this be another symptom of #6620?

jsquyres commented 3 years ago

Probably the most relevant fact is that this system is RHEL 6. I was using a manually-installed gcc 8.2.0 (which obviously supports flags like -mavx2 and friends), but I also got the exact same behavior from a manually-installed gcc 10.2.0 on the same system. So perhaps it has something to do with the RHEL 6 assembler or linker...? 🤷‍♂️

Open MPI just recently started shipping AVX support, and I recall that there was some similar issues. I don't recall all the details, but you might want to start poking around:

I know there was some hooplah around AVX compilation issues, particularly with downstream Open MPI packagers (leading to linker errors and the like); I'm afraid I only did some light testing of the fixes and wasn't deeply involved in figuring out what the fixes should be. Plus, it was a few months ago, and it's fallen out of my brain cache... ☹️

acgoldma commented 3 years ago

From what I can tell by comparing the configure check code, my patch in #6620 should fix this. I do not have an old enough machine to test this. All the machines I have available all support at least avx2. Could you test this?

jsquyres commented 3 years ago

Sorry, #6620 does not fix the issue. See https://github.com/ofiwg/libfabric/pull/6620#issuecomment-796878858.

acgoldma commented 3 years ago

Your configuration seems really odd, what CPU do you have inside (does it support avx2)?

I believe RHEL 6 itself does not support avx2. It is very old and is now out of support and into its extended lifetime support. Redhat also has its own gcc compilers/linkers for this distro. To my knowledge, using a custom compiler on top of a old distro is unsupported by redhat. This invites issue where the compiler supports a feature the CPU does not and the compiler trys to compile for a cpu/feature it dopes not support.

Can you give more info about the build system?

 # uname -r
 # lscpu | grep -i flags

If you don't have lscpu for some reason the following should work.

# grep flags /proc/cpuinfo | sed -n 1p
jsquyres commented 3 years ago
$ uname -r
2.6.32-431.20.3.el6.x86_64

$ lscpu | grep -i flags
$

Here's the full /proc/cpuinfo for a single Linux virtual processor, including the flags:

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
stepping        : 7
cpu MHz         : 2900.153
cache size      : 20480 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 47
initial apicid  : 47
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5799.18
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
jsquyres commented 3 years ago

I believe RHEL 6 itself does not support avx2. It is very old and is now out of support and into its extended lifetime support. Redhat also has its own gcc compilers/linkers for this distro.

I think it would be fine to fully disable AVX (or even a single provider) on platforms where it is not supported. That should not disqualify building the rest of Libfabric.

I pointed you to the Open MPI AVX work as evidence of and an example of an AVX-enabled software package that can handle platforms like mine and still compile / install properly. My system ended up being fairly symptomatic of other downstream Open MPI packagers (i.e., we got bug reports from several of them about AVX compiler/linker issues; I don't know if the issue was exactly the same as my platform, but, FWIW, resolving their issues also resolved the issues on my platform).

To my knowledge, using a custom compiler on top of a old distro is unsupported by redhat. This invites issue where the compiler supports a feature the CPU does not and the compiler trys to compile for a cpu/feature it dopes not support.

Two common mantras in the HPC world are:

  1. Use Linux distro inbox everything.
  2. Except for the compiler, network stack, and MPI implementation.

Meaning: I don't think it's that uncommon to have a Spack/Easybuild/whatever-installed recent version of gcc.

acgoldma commented 3 years ago

Okay, I think I have a working patch, I need to use AC_RUN_IFELSE, to check that it compiles and runs, if not, if will disable psm3.

acgoldma commented 3 years ago

Okay, I think I have a working patch, I need to use AC_RUN_IFELSE, to check that it compiles and runs, if not, if will disable psm3.

Hmm, it seems to generate a core file, not sure if I can disable that or work something else out.

Ideally, we just want PSM3 to only build on systems that support AVX or higher. I want to avoid wrapping the code in checks and just disable the provider.

jsquyres commented 3 years ago

You might want to use AC_LINK_IFELSE instead. AC_RUN_IFELSE is hostile to cross-compiling.

acgoldma commented 3 years ago

AC_LINK_IFELSE only checks that the linker/compiler can compile/link, which is what #6620 already has.

This only checks that the compiler can generate the code, which it should be able to as it is a newer GCC, but not sure how to test for if the CPU supports it.

hmm, it looks like your errors are form the assembler "as" which is part of binutils not gcc. Hmm, I wonder if there is a what to detect the version from that as well.

jsquyres commented 3 years ago

6620 fails during the build/link phase of libfabric, not at run time.

acgoldma commented 3 years ago

6620 fails during the build/link phase of libfabric, not at run time.

I have tested on my machine that supports up to avx2, that I can compile for avx512 just fine (I searched objects to see 512 instructions). It compiles just fine, but is expected to not be able to run (which is expected) if it hits a newer instruction set. I think there is some issue with you compiler, assembler combination.

Either way, that does not fix the issue about disabling psm3 when avx2 is not support by cpu (instead of compiler support), but your issue seems unrelated unrelated as it is at compile time.

I even wrote up a simple program using avx512 instructions and proved that it can compile just fine. (will segfault on run, but that is expected).

Maybe you could provide a bit more info, like a snippet from the config.log where psm3 checks for avx/etc. as well as verbose make output: make V=1?

When you tested the patch in #6620, did you make sure to also patch the Makefile.include to remove the -mavx2 form the _psm3_cflags?

jsquyres commented 3 years ago

My point is that even if the compiler supports -mavx*, the assembler/linker/whatever may not support it, and AC_LINK_IFELSE will tell you that. E.g., if you find that the compiler supports-mavx2 but then fails the AC_LINK_IFELSE, you can still disable AVX2 support (which may or may not entail wholly disabling PSM3; that's your call).

As for the run-time detection, you can check that, too -- have a look at Open MPI's run-time probing code. I.e., Open MPI compiles for as high a level of AVX as it can based on what it finds via configure, and then at run time, it probes to see the highest level of AVX support that it can find a) in the compiled code, and b) on the running machine, and uses that. This was necessary because downstream packagers tend to compile Open MPI on modern, powerful machines with all features enabled, but users sometimes take those binary packages and run on less powerful / less feature-full machines. Put simply: it is not good to assume that the build machine is the same as the run machine.

When you tested the patch in #6620, did you make sure to also patch the Makefile.include to remove the -mavx2 form the _psm3_cflags?

No, I tested #6620 as-is. Shouldn't #6620 have handled everything such that the user wouldn't need to edit Makefile.include?

shefty commented 3 years ago

It is very common for libfabric to run on a system that is not the build system. It's why we have to provide configure options to manually disable packages.

acgoldma commented 3 years ago

https://github.com/ofiwg/libfabric/pull/6620/commits/dc56a2f06d6a4ba55d48ff50968066fbb14e41a3

6620 does use AC_LINK_IFELSE

No, I tested #6620 as-is. Shouldn't #6620 have handled everything such that the user wouldn't need to edit Makefile.include?

Sorry, I meant that when you tested the patch, you included both parts of the patch, which if you checked out the patch and built it as is, then it should have worked.

jsquyres commented 3 years ago

Sorry, I meant that when you tested the patch, you included both parts of the patch, which if you checked out the patch and built it as is, then it should have worked.

I used hub to check out your PR. I tested git hash dc56a2f06, which appears to still be the head commit on #6620.

Here's the head of my tree:

* dc56a2f06 (HEAD -> psm3-mavx2-check) prov/psm3: Add check for avx2 to configure
*   dadc382c9 (origin/master, origin/HEAD, master) Merge pull request #6617 from acgold
|\  
| * 7bb18d567 prov/psm3: Use AR variable and do not call ar directly
* |   de1b4e8b7 Merge pull request #6616 from acgoldma/psm3-pack_suffix
|\ \  
| |/  
|/|   
| * 8967633e7 prov/psm3: Add PACK_SUFFIX define to missing header
* |   0e4566634 Merge pull request #6613 from shefty/master
|\ \  
| |/  
|/|   
| * e97d29b94 v1.13.0a1
|/  
* b5c35d115 (tag: v1.12.0, origin/v1.12.x) v1.12.0 release
acgoldma commented 3 years ago

@jsquyres I do not have access to an older system to replicate your environment, can you check my latest patch when you have time. I took your advice and added the strip optflags function and used it like openmpi.