Closed johnmatt3 closed 2 years ago
Thank you very much for your comprehensive bug report.
It is likely that this is a hardware compatibility issue as you have a relatively new GPU. One thing I would definitely try is to edit Makefile.am
to use the correct gencode for your GPU. You have an RTX 3080 and you're using CUDA 11.7 so you should have something like:
AM_NVCC_FLAGS += -gencode arch=compute_80,code=sm_80
AM_NVCC_FLAGS += -gencode arch=compute_86,code=sm_86
AM_NVCC_FLAGS += -gencode arch=compute_87,code=sm_87
Could you please remove any other gencodes and try to recompile and re-run the tests?
You can read more about gencodes here (essentially GPU architecture flags): https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
The GPU version of SPRAL was developed and tested on an old Nvidia Tesla K40c GPU. While I wouldn't recommend using such old hardware, it did successfully pass all the make check tests when I tested it last year, but this was a regular build not using IPOPT. Unfortunately we have since lost access to this GPU (scrapped for being too old) so I am unable to verify whether an IPOPT build would also pass all the make check tests on it.
I am trying to follow the instructions at https://gist.github.com/tasseff/ee61ef6c15d3c54e0a6b3e488f2a65be to get IPOPT running with the GPU version of SPRAL. The CPU version works fine, but the GPU version seg faults on the test problem suggested in the instructions. Digging further, it appears that SPRAL's SSID test fails (i.e. make check fails). I am pretty confident that I have included all of the information to reproduce my situation below.
Is it possible that this is a hardware compatibility issue? I tested this on a 3080 founders edition, not a professional card. Is there any set of known working hardware that I can just buy?
my system specs: CPU: i5-10400f GPU: Nvidia 3080 Founders Edition
from nvidia-smi: Driver Version: 515.43.04 CUDA Version: 11.7
Installed ubuntu 18 from this iso: https://releases.ubuntu.com/18.04/ubuntu-18.04.6-desktop-amd64.iso
After installation I elected not to install any ubuntu updates (I believe on previous attempts at this entire process I did accept all updates, without upgrading to 20, with the same result). I also did attempt these instructions on ubuntu 20 and did not get it to work (although most of my efforts have been on 18)
setup:
compile METIS
get CUDA on ubuntu 18, from: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local
Get hwloc... the instructions suggest this
but I was recommended to try compiling from scratch and specifying the cuda version as in https://www.open-mpi.org/projects/hwloc/doc/v2.7.1/a00373.php#faq_cuda_build
I found the bug was the same in both cases
Get SPRAL:
Build/install SPRAL, changed flags from instructions to add debugging info
export required variables
now install Ipopt, added --enable-debug flag to configure for debugging
I have also previously tried an alternative metis/spral installation as in: https://github.com/ralna/spral/issues/31 this also didn't seem to solve the problem.
Test:
output looks as expected:
test gpu version:
results in segfault:
then debugging:
Backtrace shows it's in ssids?
I have also seen the segfault occur in a BuddyAllocator in other runs:
Also running nvidia-smi in another console at the same time as the gpu version of the problem (note machine name changed as I did a full reinstall per the instruction I listed above to make sure I hadn't corrupted anything, saw same behavior):
When digging further, running spral's built in check:
The contents of test-suite.log: