Build issues for GPU Direct Storage (GDS) enabled IOR

casparvl commented 3 months ago

I've been trying to build an IOR with GPU Direct Storage support and have been running into some issues. I believe there are some mistakes in how this has been implemented in the build system:

Expected behavior

From the configure --help:

...
  --with-gpfs             support configurable GPFS [default=check]
  --with-cuda             support configurable CUDA [default=check]
  --with-gpuDirect        support configurable GPUDirect [default=check]
...

Seems to suggest that it will check if the correct binaries/libraries/headers are present to support gpuDirect / cuda (note that this is indeed seems to be the behaviour for GPFS).

Thus, I would expect that:

./configure

Succesfully autodetects that I have CUDA installed, and enables gpuDirect.

Attempt 1 However, when doing

./configure
make -j 128 V=1

It errors with:

mpicc  -g -O2     -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64 -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64 -o md-workbench md_workbench-md-workbench-main.o md_workbench-aiori.o md_workbench-aiori-DUMMY.o     md_workbench-aiori-MPIIO.o  md_workbench-aiori-MMAP.o md_workbench-aiori-POSIX.o          libaiori.a                 -lcufile -lcudart -lgpfs -lm
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `update_write_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:100: undefined reference to `update_write_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `generate_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:140: undefined reference to `generate_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `verify_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:186: undefined reference to `verify_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: /sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `libaiori.a(libaiori_a-utilities.o): in function `update_write_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:100: undefined reference to `update_write_memory_patternupdate_write_memory_pattern_gpu':
'
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld100: : undefined reference to `libaiori.a(libaiori_a-utilities.o)update_write_memory_pattern_gpu: in function `'
generate_memory_pattern':
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:: 140: undefined reference to `libaiori.a(libaiori_a-utilities.o)generate_memory_pattern_gpu: in function `'
generate_memory_pattern/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld':
: /home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:libaiori.a(libaiori_a-utilities.o)140: in function `: undefined reference to `verify_memory_patterngenerate_memory_pattern_gpu':
'
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld186: : undefined reference to `libaiori.a(libaiori_a-utilities.o)verify_memory_pattern_gpu: in function `'
verify_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:186: undefined reference to `verify_memory_pattern_gpu'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:1026: ior] Error 1

The undefined references are defined in utilities-gpu.cu, and indeed, that file doesn't seem to be compiled into an object file.

[casparl@tcn1 ior-4.0.0]$ find -name *.cu.o
[casparl@tcn1 ior-4.0.0]$

It is also clear that link paths like -Lcheck/lib64 are not intended to be there: it is taking the actual default value for the with-cuda argument, and passing that as a search dir for the linker. Note that those come from e.g. this line. I think you should only append to LDFLAGS and CPPFLAGS if the user has passed a non-standard location as argument. When the argument is still the default, you should just try to locate the headers. In my case, they are on the CPATH and the compiler will find them just fine - no need to append anything.

On a side note: I see you are setting an rpath in you LDFLAGS, you might want to reconsider that. It is not really standard behaviour and could e.g. cause issues when CUDA installations are in different locations on the build machine compared to the machine on which it is run (not unthinkeable in an HPC system). In my humble opinion, it's the end user that is responsible for making sure that linked libraries are found at runtime.

Attempt 2 In a second attempt, I was more explicit:

./configure --with-gpuDirect --with-cuda=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1

Note that the standard output of this argument looks the same as before, but the config.log does not. From the plain ./configure command, I get:

$ cat config.log.bak | grep ^LDFLAGS
LDFLAGS=' -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64 -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64'
$ cat config.log.bak | grep ^CPPFLAGS
CPPFLAGS=' -Icheck/include -Icheck/include'
$ cat config.log.bak | grep GPU_DIRECT
| #define HAVE_GPU_DIRECT /**/
HAVE_GPU_DIRECT_FALSE=''
HAVE_GPU_DIRECT_TRUE='#'
#define HAVE_GPU_DIRECT /**/

But from ./configure --with-gpuDirect --with-cuda=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1 I get:

$ cat config.log | grep ^LDFLAGS
LDFLAGS=' -L/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1/lib64 -Wl,--enable-new-dtags -Wl,-rpath=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1/lib64 -Lyes/lib64 -Wl,--enable-new-dtags -Wl,-rpath=yes/lib64'
$ cat config.log | grep ^CPPFLAGS
CPPFLAGS=' -I/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1/include -Iyes/include'
$ cat config.log | grep GPU_DIRECT
| #define HAVE_GPU_DIRECT /**/
HAVE_GPU_DIRECT_FALSE='#'
HAVE_GPU_DIRECT_TRUE=''
#define HAVE_GPU_DIRECT /**/
#define HAVE_GPU_DIRECT /**/

Now, my build does complete, but that config.log still looks messy:

It seems to be using yes as a prefix somewhere, see e.g. the CPPFLAGS that includes -Iyes/include
It seems to define #define HAVE_GPU_DIRECT twice now Anyway, these three points don't actually seem to break anything, but would be nice to clean up nonetheless.

Attempt 3 In a third attempt, I tried to run with optimization arguments. I'm building software for HPC systems, and we optimize all software by default for the hardware architecture on which it is going to be run.

CFLAGS="-O3 -mavx2 -mfma -fno-math-errno" ./configure --with-gpuDirect --with-cuda=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1

Now, I get

$ make  -j 128 V=1
Making all in src
make  all-recursive
Making all in .
nvcc -O3 -mavx2 -mfma -fno-math-errno -c -o utilities-gpu.o utilities-gpu.cu
nvcc fatal   : Unknown option '-mavx2'
make[3]: *** [Makefile:3721: utilities-gpu.o] Error 1

This makes sense, nvcc doesn't know about such optimization arguments. When and where to pass CFLAGS/CXXFLAGS/etc is always a bit of a pain. However, it is my understanding that the best practice for CUDA codes is to not pass CFLAGS/CXXFLAGS/etc just like that, as it is likely to contain arguments unknown to the CUDA compiler. The best example might be NVIDIA's own CUDA-samples, see e.g. here, where the problem is solved by using an NVCCFLAGS for nvcc specific flags, and passing the CFLAGS as argument to the -Xcompiler option ("Specify options directly to the compiler/preprocessor", i.e. these are passed on to the host compiler). In that case, replacing this with $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $< is already a good first step and will avoid most issues, though you may want $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) $(NVCCFLAGS) -c -o $@ $< to also allow the user to specify NVCC specific flags.

I made the replacement with $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $< in Makefile.in and Makefile.am manually, and with that, I get:

nvcc -Xcompiler -O3 -Xcompiler -O2 -Xcompiler -mavx2 -Xcompiler -mfma -Xcompiler -fno-math-errno -c -o utilities-gpu.o utilities-gpu.cu

during the build and indeed this build completes succesfully.

JulianKunkel commented 3 months ago

Thanks for the detailed bug report and try, I know this is a bit too messy. Can you try to add --with-nvcc with the configure call to see if that would catch up at least the invocation? As you realized the nvcc is actually needed to compile utilities-gpu.cu

Your feedback for the NVCC invocation makes sense, and if it would at least try building using with-nvcc, the fix with NVCCFLAGS can be added.

casparvl commented 3 months ago

You mean if

./configure --with-nvcc=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1/bin/nvcc

would trigger a compilation of utilities-gpu.cu? I'm afraid it doesn't. Configuring like that, and building, I still get my original error, and indeed, no object file is found for uitilities-gpu.cu:

[casparl@tcn1 ior-4.0.0]$ find -name *.cu.o
[casparl@tcn1 ior-4.0.0]$

Just to check, I also tried:

./configure --with-cuda=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1

That also results in the same issue (utilities-gpu.cu not being compiled). I also tried:

./configure --with-gpuDirect

That does compile the object file for uitlities-gpu.cu (and the build then completes, provided I put the fix in place to prefix the CFLAGS with -Xcompiler for the nvcc-compiled part).

hpc / ior

Build issues for GPU Direct Storage (GDS) enabled IOR #486