It would be beneficial to further gpu refactoring (not to mention code readability and maintainability) to be able to separate cuda function definitions and usages in separate compliation units as we commonly do in typical C++ projects. In much earlier versions of cuda, this was not possible, but it is now supported (starting with cuda 5.0, see discussion here). However, this requires modifications to the tps build process when cuda is available.
Approach
Specifically, it is necessary to add -dc (--device-c, see documentation here). This causes two complications.
First, configure checks (in our case the check for mfem) can be sensitive to where the -dc flag appears. Adding -dc directly to CUDA_CXXFLAGS in configure.ac (which then gets added to CXXFLAGS) leads to failures in AC_LINK_IFELSE because the -dc flag prevents generation of an executable. The fix is to treat -dc as analogous to -c in the sense that the build system is responsible for using it appropriately---it is not added to CXXFLAGS at configure time. See 3aca83e for more details.
Second, linking shared libraries, as we are now doing to support python, requires a two stage process. Specifically, after compiling each .cpp file using -dc, we need an intermediate "device-link" step prior to linking libtps.so. This step is implemented with a custom build rule that creates a new object called tmp_cuda_object.o. See 1c71f1b for the full approach.
Potential Issues
It has been noted (e.g., here) that applications produced with -dc may be slower than analogous code built without -dc, because some optimizations are not possible. I have seen only minimal effects---less than 1% change in timings on lassen for the performance test from #131---with the current code. But this is something to keep in mind going forward.
The implementation of the two-stage build, specifically the generation of tmp_cuda_object.o is a bit brittle. Need to be aware of changes that could break it (e.g., adding more files to utils/mfem_extras).
Purpose
It would be beneficial to further gpu refactoring (not to mention code readability and maintainability) to be able to separate cuda function definitions and usages in separate compliation units as we commonly do in typical C++ projects. In much earlier versions of cuda, this was not possible, but it is now supported (starting with cuda 5.0, see discussion here). However, this requires modifications to the
tps
build process when cuda is available.Approach
Specifically, it is necessary to add
-dc
(--device-c
, see documentation here). This causes two complications.First, configure checks (in our case the check for mfem) can be sensitive to where the
-dc
flag appears. Adding-dc
directly toCUDA_CXXFLAGS
inconfigure.ac
(which then gets added toCXXFLAGS
) leads to failures inAC_LINK_IFELSE
because the-dc
flag prevents generation of an executable. The fix is to treat-dc
as analogous to-c
in the sense that the build system is responsible for using it appropriately---it is not added toCXXFLAGS
at configure time. See 3aca83e for more details.Second, linking shared libraries, as we are now doing to support python, requires a two stage process. Specifically, after compiling each
.cpp
file using-dc
, we need an intermediate "device-link" step prior to linkinglibtps.so
. This step is implemented with a custom build rule that creates a new object calledtmp_cuda_object.o
. See 1c71f1b for the full approach.Potential Issues
-dc
may be slower than analogous code built without-dc
, because some optimizations are not possible. I have seen only minimal effects---less than 1% change in timings on lassen for the performance test from #131---with the current code. But this is something to keep in mind going forward.tmp_cuda_object.o
is a bit brittle. Need to be aware of changes that could break it (e.g., adding more files toutils/mfem_extras
).