ralna / spral

Sparse Parallel Robust Algorithms Library
https://ralna.github.io/spral/
Other
105 stars 27 forks source link

make check fails when compiling for gpu #178

Open johnmatt3 opened 10 months ago

johnmatt3 commented 10 months ago

I am trying to install spral with gpu support (for eventual use in Ipopt). My hardware (and goal) is the same as a previous issue. Feel free to close this issue if the near-term plan is to use meson entirely moving forward in a way that will reliably build for use in Ipopt. I also tried the current meson build method, setting the gpu option resulted in an error because it didn't know how to compile .cu files. I added cuda to the project languages, but it couldn't find nvcc to compile with (nvcc is available on the command line). I presume this approach is getting fixed up based on this issue. Anyway, here's my results with the configure/make instructions I could find.

My software setup is fresh install of ubuntu 22.04 LTS

sudo apt-get install -y git build-essential gfortran pkg-config
sudo apt-get install -y libopenblas-dev
sudo apt-get install -y libudev-dev
sudo apt-get install -y autoconf libtool

Get cuda, from: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.1-545.23.08-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

export CUDA_HOME="/usr/local/cuda" # Change this to your system-specific path. (this should be good here)
export PATH="${PATH}:${CUDA_HOME}/bin"
export LIBRARY_PATH="${LIBRARY_PATH}:${CUDA_HOME}/lib64"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64"
export C_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export CPLUS_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export NVCC_INCLUDE_FLAGS="${NVCC_INCLUDE_FLAGS}:-I${CUDA_HOME}/include"

sudo apt-get install -y nvidia-kernel-open-545
sudo apt-get install -y cuda-drivers-545

Not sure if this hwloc will let me have gpu support. Instructions suggest it must be installed from source but then I believe spral wouldn't even compile (using whatever the git clone of hwloc gave me)

sudo apt-get install -y hwloc libhwloc-dev
sudo reboot now

Redo the cuda exports above. Going to try to install coin-or's metis to hopefully get compatibility with Ipopt down the road? (from https://github.com/ralna/spral/blob/master/COMPILE.md)

rm -rf ${HOME}/Software
mkdir -p ${HOME}/Software
cd ${HOME}/Software
git clone https://github.com/coin-or-tools/ThirdParty-Metis.git
cd ThirdParty-Metis && ./get.Metis
mkdir build
cd build
../configure --prefix=${PWD}
make && make install
export METISDIR=${HOME}/Software/ThirdParty-Metis/build

then compile spral

cd ${HOME}/Software/
git clone https://github.com/ralna/spral.git
cd spral
git checkout 5e8b409
./autogen.sh # If compiling from fresh git checkout
mkdir build
cp nvcc_arch_sm.c build/ # If building for GPU
cd build
CFLAGS=-fPIC CPPFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fPIC \
   FCFLAGS=-fPIC NVCCFLAGS="-shared -Xcompiler -fPIC" \
   ../configure --prefix=${PWD}/build \
   --with-blas="-lopenblas" --with-lapack="-llapack" \
   --with-metis="-L${METISDIR}/lib -lcoinmetis" \
   --with-metis-inc-dir="${METISDIR}/include/coin-or/metis"
make && make install

export OMP_CANCELLATION=TRUE
export OMP_PROC_BIND=TRUE
export SPRALDIR=${PWD}

make check

Compiles fine, make check results in:

PASS: lsmr_test
PASS: rutherford_boeing_test
PASS: scaling_test
PASS: random_test
PASS: random_matrix_test
../test-driver: line 112: 16120 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ssids_test
PASS: ssids_kernel_test
PASS: ssmfe_test
PASS: ssmfe_ciface_test
============================================================================
Testsuite summary for spral 2023.07.04
============================================================================
# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to hsl@stfc.ac.uk
============================================================================
make[2]: *** [Makefile:2613: test-suite.log] Error 1
make[2]: Leaving directory '/home/john/Software/spral/build'
make[1]: *** [Makefile:2721: check-TESTS] Error 2
make[1]: Leaving directory '/home/john/Software/spral/build'
make: *** [Makefile:2989: check-am] Error 2

Thanks!

jfowkes commented 10 months ago

Thank you for the report, could you post the contents of test-suite.log so we can dig a bit deeper? Yes we plan to use meson to build for GPU aswell with a view to making it the default build system, @amontoison is currently looking into this.

johnmatt3 commented 10 months ago

Oh, it just can't find libcoinmetis.so.2:

========================================
   spral 2023.07.04: ./test-suite.log
========================================

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: ssids_test
================

./ssids_test: error while loading shared libraries: libcoinmetis.so.2: cannot open shared object file: No such file or directory
FAIL ssids_test (exit status: 127)

Can confirm that lib is where I would have expected it to be:

john@spralTest:~/Software/spral/build$ ls -lart ~/Software/ThirdParty-Metis/build/lib 
total 312
-rwxr-xr-x 1 john john 302064 Dec  7 13:13 libcoinmetis.so.2.0.0
lrwxrwxrwx 1 john john     21 Dec  7 13:13 libcoinmetis.so.2 -> libcoinmetis.so.2.0.0
lrwxrwxrwx 1 john john     21 Dec  7 13:13 libcoinmetis.so -> libcoinmetis.so.2.0.0
-rwxr-xr-x 1 john john    969 Dec  7 13:13 libcoinmetis.la
drwxrwxr-x 2 john john   4096 Dec  7 13:13 pkgconfig
drwxrwxr-x 3 john john   4096 Dec  7 13:13 .
drwxrwxr-x 6 john john   4096 Dec  7 13:13 ..

If it helps, my formal training is in mechanical engineering rather than CS, so there might be something very simple I'm missing. I just reinstalled ubuntu and followed the instructions I listed above again (i.e. if there's something simple that you are assuming competent people do with their ubuntus not listed above, I am not competent, and I did not do it). Also note that I have been running from spral's commit 5e8b409 since that was the last time COMPILE.md was modified, but I just tried again from spral's master branch, same result and log.

Thanks!

amontoison commented 10 months ago

Can you try with this precompiled metis?

johnmatt3 commented 10 months ago

following up on the error, it looked like LD_LIBRARY_PATH might help spral find the metis. I switched to your precompiled metis,

Then with metisdir pointing to your precompiled metis

john@spralTest:~/Software/spral/build$ echo ${METISDIR}
/home/john/Software/metis-5.1.2

I did

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${METISDIR}/lib
sudo ldconfig

so

john@spralTest:~/Software/spral/build$ echo ${LD_LIBRARY_PATH}
:/usr/local/cuda/lib64:/home/john/Software/metis-5.1.2/lib

reconfigured a new checkout of spral with

CFLAGS=-fPIC CPPFLAGS=-fPIC CXXFLAGS=-fPIC FFLAGS=-fPIC \
   FCFLAGS=-fPIC NVCCFLAGS="-shared -Xcompiler -fPIC" \
   ../configure --prefix=${PWD}/build \
   --with-blas="-lopenblas" --with-lapack="-llapack" \
   --with-metis="-L${METISDIR}/lib -lmetis" \
   --with-metis-inc-dir="${METISDIR}/include/"

The make check gets further:


PASS: lsmr_test
PASS: rutherford_boeing_test
PASS: scaling_test
PASS: random_test
PASS: random_matrix_test
../test-driver: line 112: 12903 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ssids_test
PASS: ssids_kernel_test
PASS: ssmfe_test
PASS: ssmfe_ciface_test
============================================================================
Testsuite summary for spral 2023.07.04
============================================================================
# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to hsl@stfc.ac.uk
============================================================================
make[2]: *** [Makefile:2613: test-suite.log] Error 1
make[2]: Leaving directory '/home/john/Software/spral/build'
make[1]: *** [Makefile:2721: check-TESTS] Error 2
make[1]: Leaving directory '/home/john/Software/spral/build'
make: *** [Makefile:2989: check-am] Error 2
john@spralTest:~/Software/spral/build$ more test-suite.log 
========================================
   spral 2023.07.04: ./test-suite.log
========================================

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: ssids_test
================

FAIL ssids_test (exit status: 139)

I thought it might be that now my installation of hwloc (apt-get hwloc libhwloc-dev) is holding me back because it wasn't compiled on my machine for gpus? I then tried

sudo apt remove hwloc libhwloc-dev
cd $HOME/Software
git clone https://github.com/open-mpi/hwloc.git
cd hwloc
./autogen.sh
./configure --with-cuda-version=12.3
make 
sudo make install
sudo ldconfig

But then recompiling spral, the make fails to compile

...
In file included from ../src/hw_topology/guess_topology.cxx:20:
../src/hw_topology/hwloc_wrapper.hxx: In member function ‘std::vector<hwloc_obj*> spral::hw_topology::HwlocTopology::get_numa_nodes() const’:
../src/hw_topology/hwloc_wrapper.hxx:54:58: error: ‘HWLOC_OBJ_NODE’ was not declared in this scope; did you mean ‘HWLOC_OBJ_CORE’?
   54 |       int nregions = hwloc_get_nbobjs_by_type(topology_, HWLOC_OBJ_NODE);
      |                                                          ^~~~~~~~~~~~~~
      |                                                          HWLOC_OBJ_CORE
make[1]: *** [Makefile:2496: src/hw_topology/guess_topology.o] Error 1
make[1]: Leaving directory '/home/john/Software/spral/build'
make: *** [Makefile:1660: all] Error 2
...

Then I removed hwloc's repo and tried to reinstall hwloc and libhwloc-dev, but recompiling gives me the same bug, so something is messed up about my system so I'll just do a fresh install.

jfowkes commented 10 months ago

Yeah I would use the precompiled dependencies that ship with ubuntu, for our CI tests on ubuntu we do:

sudo apt-get install libhwloc-dev libudev-dev libmetis-dev libopenblas-dev

and then setup as follows:

./autogen.sh
./configure CC=gcc \
            CXX=g++ \
            F77=gfortran \
            FC=gfortran \
            LIBS=-lstdc++\
            CFLAGS="-g -O3 -Wall -fopenmp" \
            CXXFLAGS="-g -O3 -std=c++17 -Wall -fopenmp" \
            FCFLAGS="-g -O3 -Wall -fopenmp -pedantic" \
            --with-blas="-lopenblas" \
            --with-lapack="-lopenblas" \
            --with-metis="-lmetis"
make

which works for us. We're working on getting a VM so that we can test GPU on the CI as well.

jfowkes commented 10 months ago

But of course in your case you'll need to compile hwloc from source with CUDA support. I wouldn't use the development master branch for this though, version 2.8.0 of hwloc has worked for us in the past: https://www.open-mpi.org/software/hwloc/v2.8/

johnmatt3 commented 9 months ago

Installing libmetis-dev and compiling hwloc from the v2.8 branch then calling ldconfig, then calling with the configure line above gets the make check tests to pass. I'm not sure if the tests check for gpu usage, but spamming nvidia-smi when the make check is running doesn't seem to show any gpu usage.

One other issue is getting Ipopt to use spral, which I believe requires a .so shared library rather than the libspral.a. Following the directions on the readme:

gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lblas -llapack -lhwloc -lmetis -lstdc++ -o libspral.so

gives me

john@spralTest:~/Software/spral$ gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lblas -llapack -lhwloc -lmetis -lstdc++ -o libspral.so
/usr/bin/ld: libspral.a(solve.o): warning: relocation against `_ZN40_GLOBAL__N__fb1c79fb_8_solve_cu_7bcf73a316reducing_d_solveILi256ELb1EEEvPNS_23reducing_d_solve_lookupEPdPKd' in read-only section `.text'
/usr/bin/ld: libspral.a(assemble.o): relocation R_X86_64_PC32 against symbol `_ZN44_GLOBAL__N__2339ee2f_11_assemble_cu_df8d376113cu_load_nodesEPKNS_15load_nodes_typeEPKlPKd' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status

which suggests some cuda code hasn't been compiled as a .so? I tried reconfiguring spral by adding -fPICs and flags for nvcc from previous instructions, and also the -lcudadevrt et al to the libs argument:

./configure CC=gcc \
            CXX=g++ \
            F77=gfortran \
            FC=gfortran \
            LIBS="-lstdc++ -lcudadevrt -lcudart -lcuda -lcublas" \
            CFLAGS="-g -O3 -Wall -fPIC" \
            CXXFLAGS="-g -O3 -std=c++17 -Wall -fopenmp -fPIC"  \
            FCFLAGS="-g -O3 -Wall -pedantic -fopenmp -fPIC" \
            NVCCFLAGS="-shared -Xcompiler -fPIC" \
            --with-blas="-lopenblas" \
            --with-lapack="-lopenblas" \
            --with-metis="-lmetis"

which configures fine, although I see some warnings including these, which seem worrisome considering I think code sm_86 should work for my 3080 gpu?

nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_86)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_86)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_86)
nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)

regardless, the code builds fine, but then make check fails with

PASS: lsmr_test
PASS: rutherford_boeing_test
PASS: scaling_test
PASS: random_test
PASS: random_matrix_test
./test-driver: line 112: 109959 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ssids_test
PASS: ssids_kernel_test
PASS: ssmfe_test
PASS: ssmfe_ciface_test
============================================================================
Testsuite summary for spral 2023.11.15
============================================================================
# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to hsl@stfc.ac.uk
============================================================================
make[2]: *** [Makefile:2618: test-suite.log] Error 1
make[2]: Leaving directory '/home/john/Software/spral'
make[1]: *** [Makefile:2726: check-TESTS] Error 2
make[1]: Leaving directory '/home/john/Software/spral'
make: *** [Makefile:2994: check-am] Error 2

with a test-suite.log of

john@spralTest:~/Software/spral$ more test-suite.log 
========================================
   spral 2023.11.15: ./test-suite.log
========================================

# TOTAL: 9
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: ssids_test
================

FAIL ssids_test (exit status: 139)

But I am able to create a .so with: gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lblas -llapack -lhwloc -lmetis -lstdc++ -lcudadevrt -lcudart -lcuda -lcublas -o libspral.so

I can also from there compile ipopt and get it to run with spral successfully. However, I can't get it to every seemingly use the gpu regardless of the cpu/gpu weighting and min_work.

One interesting thing is when I tried adding the cuda libs but not -fPIC:

./configure CC=gcc \
            CXX=g++ \
            F77=gfortran \
            FC=gfortran \
            LIBS="-lstdc++ -lcudadevrt -lcudart -lcuda -lcublas" \
            CFLAGS="-g -O3 -Wall -fopenmp" \
            CXXFLAGS="-g -O3 -std=c++17 -Wall -fopenmp" \
            FCFLAGS="-g -O3 -Wall -fopenmp -pedantic" \
            --with-blas="-lopenblas" \
            --with-lapack="-lopenblas" \
            --with-metis="-lmetis"

Then make worked with fewer warnings about cuda codes: (sm_86 didn't get skipped anymore?)

nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_80)
nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)
nvlink warning : Skipping incompatible '/usr/lib/x86_64-linux-gnu/libpthread.a' when searching for -lpthread (target: sm_87)

make check succeeds, but creating the .so fails again, but with a different error?

gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lblas -llapack -lhwloc -lmetis -lstdc++ -lcudadevrt -lcudart -lcuda -lcublas -o libspral.so
/usr/bin/ld: libspral.a(solve.o): warning: relocation against `_ZN40_GLOBAL__N__fb1c79fb_8_solve_cu_7bcf73a316reducing_d_solveILi256ELb1EEEvPNS_23reducing_d_solve_lookupEPdPKd' in read-only section `.text'
/usr/bin/ld: libspral.a(NumericSubtree.o): relocation R_X86_64_PC32 against symbol `__libc_single_threaded@@GLIBC_2.32' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status

Is there a way to use the libspral.a in Ipopt? Is there a way with just spral itself to test whether it is using the gpu? Should the gpu be used in make check?

amontoison commented 9 months ago

If you compile SPRAL with OpenBLAS, you need different link flags for generating the shared library:

gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lopenblas -lhwloc -lmetis -lstdc++ -o libspral.so

If you use a GPU, you probably need to add additional flags:

gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lopenblas -lhwloc -lmetis -lstdc++ -lcudadevrt -lcudart -lcuda -lcublas -o libspral.so

If you generate a shared libraries, you just need the link flag -lspral. Otherwise you need to provide the link flags of all dependencies if you use libspral.a: https://coin-or.github.io/Ipopt/INSTALL.html (section SPRAL).

For the question about make check and testing GPU features, @jfowkes will probably help. Next year we will support the GPU version of SPRAL with Meson, it will simplify a lot of things...

amontoison commented 8 months ago

@johnmatt3 Can you try to compile SPRAL with Meson? We added the GPU support for this build system with the new release.

johnmatt3 commented 7 months ago

Thanks for the update! I enlisted a real software buddy of mine (henceforth referred to as the linux sherpa) to help guide me through this attempt, so I got a bit further than usual. Note that I ran this on the code from a couple weeks ago.

TLDR: spral appears to be finding, but not using, my gpu (looks to my untrained eye like an issue with the guess_topology logic?). If run as root and force the hwloc wrapper code to use my gpu, then ssidst loads something onto my gpu but the tests error (what looks like a combination of "fail residual" errors, and a potential memory allocation issue).

My configuration for this attempt:

CPU build is fine

meson setup builddir -Dexamples=true -Dtests=true -Dlibblas=openblas -Dliblapack=openblas --reconfigure
meson compile -C builddir
cd builddir
./ssidst

the ssidst all seem to pass ok, total number of errors is 0.

Basic GPU build fails

in shell:

export CUDA_HOME="/usr/local/cuda"
export LIBRARY_PATH="${LIBRARY_PATH}:${CUDA_HOME}/lib64"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64"
export C_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export CPLUS_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export NVCC_INCLUDE_FLAGS="${NVCC_INCLUDE_FLAGS}:-I${CUDA_HOME}/include"
export OMP_CANCELLATION=TRUE
export OMP_PROC_BIND=TRUE

# "gpu build and test instructions"
rm -rf builddir
meson setup builddir -Dexamples=true -Dtests=true -Dlibblas=openblas -Dliblapack=openblas -Dgpu=true --reconfigure
meson compile -C builddir
cd builddir
./ssidst

The build shows gpu: true and it finds various cuda libraries successfully.

...
Cuda compiler for the host machine: nvcc (nvcc 12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0)
Cuda linker for the host machine: nvcc nvlink 12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0
Library openblas found: YES
Library openblas found: YES
Library metis found: YES
Library hwloc found: YES
Run-time dependency CUDA (modules: cudart_static, rt, pthread, dl, cublas) found: YES 12.3 (/usr/local/cuda)
Library m found: YES
Has header "cblas.h" : YES
Has header "hwloc.h" : YES
Build targets in project: 45

SPRAL 2024.01.18

User defined options
examples : true
gpu : true
libblas : openblas
liblapack: openblas
tests : true

Found ninja-1.10.1 at /usr/bin/ninja
INFO: autodetecting backend as ninja
INFO: calculating backend command to run: /usr/bin/ninja -C /home/john/Software/spral2/builddir
ninja: Entering directory `/home/john/Software/spral2/builddir'
[195/195] Linking target kernelst_cpp
...

Then the tests all pass but nvidia-smi shows no new process on the gpu and no apparent additional gpu usage beyond baseline.

Looking in the code we added some printfs to see if the gpu is getting used, hwloc is getting found, etc. (this is where I would have turned back and tried to get off the mountain alive without my linux sherpa)

john@spralTest:~/Software/spral2$ git diff src/hw_topology/hwloc_wrapper.hxx
diff --git a/src/hw_topology/hwloc_wrapper.hxx b/src/hw_topology/hwloc_wrapper.hxx
index abd01b3..5504588 100644
--- a/src/hw_topology/hwloc_wrapper.hxx
+++ b/src/hw_topology/hwloc_wrapper.hxx
@@ -37,12 +37,15 @@ public:
/** \brief Constructor */
HwlocTopology() {
hwloc_topology_init(&topology_);
+ printf("\n**********\nmaking a hwlocTopology:\n");
#if HWLOC_API_VERSION >= 0x20000
+ printf("\nHWLOC_API_VERSION >= 0x20000\n");
hwloc_topology_set_type_filter(topology_, HWLOC_OBJ_OS_DEVICE,
HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
hwloc_topology_set_type_filter(topology_, HWLOC_OBJ_PCI_DEVICE,
HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
#else /* HWLOC_API_VERSION */
+ printf("\nHWLOC_API_VERSION !!!!!!>= 0x20000\n");
hwloc_topology_set_flags(topology_, HWLOC_TOPOLOGY_FLAG_IO_DEVICES);
#endif /* HWLOC_API_VERSION */
hwloc_topology_load(topology_);
@@ -82,10 +85,12 @@ public:
std::vector<int> get_gpus(hwloc_obj_t const& obj) const {
std::vector<int> gpus;
#ifdef HAVE_NVCC
+ printf("\nHAVE_NVCC");
int ngpu;
cudaError_t cuda_error = cudaGetDeviceCount(&ngpu);
+ printf("\ngpus from cudaGetDeviceCount = %d\n", ngpu);
if(cuda_error != cudaSuccess) {
- //printf("Error using CUDA. Assuming no GPUs.\n");
+ printf("\nError using CUDA. Assuming no GPUs.");
return gpus; // empty
}
/* Now for each device search up its topology tree and see if we
@@ -95,6 +100,7 @@ public:
// hwloc_obj_t p = hwloc_cudart_get_device_osdev_by_index(topology_, i);
hwloc_obj_t p = hwloc_cudart_get_device_pcidev(topology_, i);
for(; p; p=p->parent) {
+ printf("looking to see if parent is obj p = %p, obj = %p\n", p, obj);
if(p==obj) {
gpus.push_back(i);
break;

running "gpu build and test instructions", the configure and compilation appear successful. The tests yelds a lot of

**********
making a hwlocTopology:

HWLOC_API_VERSION >= 0x20000

prints, but no HAVE_NVCC prints or any of the other printfs. This suggested hwloc was not finding my GPU and that HAVE_NVCC was false.

Forcing HAVE_NVCC to true, hwloc finds my GPU, but fails to select it for use?

We surmised that HAVE_NVCC should be true when -Dgpu=true?, so adding this line at line 92 of meson.build:

add_global_arguments('-DHAVE_NVCC', language : 'cpp')

Then rerunning "gpu build and test instructions", we get a lot of prints, the script ends like this:

...
==================
Testing big matrix
==================
* n = 2000 nza = 10000...
**********
making a hwlocTopology:

HWLOC_API_VERSION >= 0x20000

HAVE_NVCC
gpus from cudaGetDeviceCount = 1
looking to see if parent is obj p = 0x55aeb2e9f200, obj = 0x55aeb2df6780
looking to see if parent is obj p = 0x55aeb473fea0, obj = 0x55aeb2df6780
num_flops: 492.4 ok...
==========================
Total number of errors = 0

Still no new process shows up in nvidia-smi, and no appreciable extra gpu usage is reported. From our understanding, these printfs suggested that it does find my gpu (gpus from cudaGetDeviceCount = 1), but as it goes up the hwloc_obj_t hierarchy looking for the numa node that was passed in from guess_topology.cxx (?), it fails to find that numa node. none of the parent pointers (p = ...) match up with the numa node pointer (obj = ...). Does my gpu belong to some inaccesable numa node? As a test, I ran openai's whisper code on my gpu, and it creates a process in nvidia-smi, so the gpu is useable in some sense.

Digging deeper (and this is where I definitely would have died on the mountain without a linux sherpa): Running strace ./ssidst resulted in lots of lines that look like this:

openat(-1, "/sys/bus/node/devices/node0/access1/initiators", O_RDONLY|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/bus/node/devices/node0/access0/initiators", O_RDONLY|O_DIRECTORY) = -1 ENOENT (No such file or directory)
faccessat2(-1, "/sys/bus/node/devices/node0/access1/initiators", X_OK, 0) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/bus/node/devices/node0/access0/initiators/read_bandwidth", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/bus/node/devices/node0/access0/initiators/write_bandwidth", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/bus/node/devices/node0/access0/initiators/read_latency", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/bus/node/devices/node0/access0/initiators/write_latency", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(-1, "/sys/devices/virtual/dmi/id", O_RDONLY|O_DIRECTORY) = 17
newfstatat(17, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
fcntl(17, F_GETFL) = 0x18000 (flags O_RDONLY|O_LARGEFILE|O_DIRECTORY)
fcntl(17, F_SETFD, FD_CLOEXEC) = 0
close(17) = 0
openat(-1, "/sys/devices/virtual/dmi/id/product_name", O_RDONLY) = 17
read(17, "System Product Name\n", 63) = 20
close(17) = 0
openat(-1, "/sys/devices/virtual/dmi/id/product_version", O_RDONLY) = 17
read(17, "System Version\n", 63) = 15
close(17) = 0
openat(-1, "/sys/devices/virtual/dmi/id/product_serial", O_RDONLY) = -1 EACCES (Permission denied)
openat(-1, "/sys/devices/virtual/dmi/id/product_uuid", O_RDONLY) = -1 EACCES (Permission denied)
openat(-1, "/sys/devices/virtual/dmi/id/board_vendor", O_RDONLY) = 17

Running ssidst as root loads it onto the gpu (!), but random matrix test SIGABRTS, core dumps

The permission denieds suggested running as root (although we still configure and compile as a nonroot user)

sudo bash

export CUDA_HOME="/usr/local/cuda"
export LIBRARY_PATH="${LIBRARY_PATH}:${CUDA_HOME}/lib64"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64"
export C_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export CPLUS_INCLUDE_PATH="${CPLUS_INCLUDE_PATH}:${CUDA_HOME}/include"
export NVCC_INCLUDE_FLAGS="${NVCC_INCLUDE_FLAGS}:-I${CUDA_HOME}/include"
export OMP_CANCELLATION=TRUE
export OMP_PROC_BIND=TRUE

./ssidst

which finally popped a new process up on nvidia-smi!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 On | 00000000:01:00.0 On | N/A |
| 30% 60C P2 98W / 320W | 1260MiB / 10240MiB | 17% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1643 G /usr/lib/xorg/Xorg 568MiB |
| 0 N/A N/A 1807 G /usr/bin/gnome-shell 74MiB |
| 0 N/A N/A 5510 C+G ...seed-version=20240110-180219.406000 117MiB |
| 0 N/A N/A 275750 G ...irefox/3626/usr/lib/firefox/firefox 0MiB |
| 0 N/A N/A 2019855 G ...sion,SpareRendererForSitePerProcess 127MiB |
| 0 N/A N/A 4099363 C ./ssidst 214MiB | <---- this guy
+---------------------------------------------------------------------------------------+

Turning off those prints (which are as expected given the gpu is being automatically returned regardless of numa node parentage), All tests are "ok", until:

=======================
Testing random matrices
=======================
- no. 1 n = 1 nza = 1... num_flops: 0.0 ok...
- no. 2 n = 2 nza = 2... num_flops: 0.0 ok...
- no. 3 n = 3 nza = 4... num_flops: 0.0 f+s fail residual 1d = 9.0189E-02
- no. 4 n = 4 nza = 6... num_flops: 0.0 f+s fail residual 1d = 8.4052E-01
- no. 5 n = 5 nza = 12... num_flops: 0.0 ok...
+ no. 6 n = 6 nza = 9... num_flops: 0.0 f+s fail residual 1d = 1.3192E-01
- no. 7 n = 7 nza = 17...tcache_thread_shutdown(): unaligned tcache chunk detected

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
num_flops: 0.0 ok...
+ no. 8 n = 8 nza = 17... num_flops: 0.0 f+s fail residual 1d = 2.2301E-02
- no. 9 n = 9 nza = 32...#0 0x7f5eec823960 in ???
#1 0x7f5eec822ac5 in ???
#2 0x7f5eec44251f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x7f5eec4969fc in __pthread_kill_implementation
at ./nptl/pthread_kill.c:44
#4 0x7f5eec4969fc in __pthread_kill_internal
at ./nptl/pthread_kill.c:78
#5 0x7f5eec4969fc in __GI___pthread_kill
at ./nptl/pthread_kill.c:89
#6 0x7f5eec442475 in __GI_raise
at ../sysdeps/posix/raise.c:26
#7 0x7f5eec4287f2 in __GI_abort
at ./stdlib/abort.c:79
#8 0x7f5eec489675 in __libc_message
at ../sysdeps/posix/libc_fatal.c:155
#9 0x7f5eec4a0cfb in malloc_printerr
at ./malloc/malloc.c:5664
#10 0x7f5eec4a56c3 in tcache_thread_shutdown
at ./malloc/malloc.c:3224
#11 0x7f5eec4a56c3 in __malloc_arena_thread_freeres
at ./malloc/arena.c:1003
#12 0x7f5eec49494e in start_thread
at ./nptl/pthread_create.c:456
#13 0x7f5eec52684f in ???
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
#14 0xffffffffffffffff in ???
Aborted (core dumped)
root@spralTest:/home/john/Software/spral2/builddir#

or another run:

=======================
Testing random matrices
=======================
- no. 1 n = 1 nza = 1... num_flops: 0.0 ok...
- no. 2 n = 2 nza = 2... num_flops: 0.0 ok...
- no. 3 n = 3 nza = 4... num_flops: 0.0 f+s fail residual 1d = 9.0189E-02
- no. 4 n = 4 nza = 6... num_flops: 0.0 f+s fail residual 1d = 8.4052E-01
- no. 5 n = 5 nza = 12... num_flops: 0.0 ok...
+ no. 6 n = 6 nza = 9... num_flops: 0.0 f+s fail residual 1d = 1.3192E-01
- no. 7 n = 7 nza = 17... num_flops: 0.0 ok...
+ no. 8 n = 8 nza = 17... num_flops: 0.0 f+s fail residual 1d = 2.2301E-02
- no. 9 n = 9 nza = 32... num_flops: 0.0 ok...
+ no. 10 n = 10 nza = 26... num_flops: 0.0 f+s fail residual 1d = 1.9827E-01
- no. 11 n = 11 nza = 55...free(): invalid next size (fast)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x7f4a42423960 in ???
#1 0x7f4a42422ac5 in ???
#2 0x7f4a4204251f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x7f4a420969fc in __pthread_kill_implementation
at ./nptl/pthread_kill.c:44
#4 0x7f4a420969fc in __pthread_kill_internal
at ./nptl/pthread_kill.c:78
#5 0x7f4a420969fc in __GI___pthread_kill
at ./nptl/pthread_kill.c:89
#6 0x7f4a42042475 in __GI_raise
at ../sysdeps/posix/raise.c:26
#7 0x7f4a420287f2 in __GI_abort
at ./stdlib/abort.c:79
#8 0x7f4a42089675 in __libc_message
at ../sysdeps/posix/libc_fatal.c:155
#9 0x7f4a420a0cfb in malloc_printerr
at ./malloc/malloc.c:5664
#10 0x7f4a420a2a9c in _int_free
at ./malloc/malloc.c:4522
#11 0x7f4a420a5452 in __GI___libc_free
at ./malloc/malloc.c:3391
#12 0x7f4a42858c57 in ???
#13 0x7f4a4285e359 in ???
#14 0x7f4a42866cdb in ???
#15 0x7f4a4286ba07 in ???
#16 0x7f4a4286ee6c in ???
#17 0x556a827175e3 in ???
#18 0x556a82725f56 in ???
#19 0x556a8271029e in ???
#20 0x7f4a42029d8f in __libc_start_call_main
at ../sysdeps/nptl/libc_start_call_main.h:58
#21 0x7f4a42029e3f in __libc_start_main_impl
at ../csu/libc-start.c:392
#22 0x556a827102d4 in ???
#23 0xffffffffffffffff in ???
Aborted (core dumped)
root@spralTest:/home/john/Software/spral2/builddir#

Not sure at this point if it's worth going too much further as presumably the hwloc numa node stuff wants to be fixed?

However, in case you wanted more info here, running the build instructions with: meson setup builddir -Dexamples=true -Dtests=true -Dlibblas=openblas -Dliblapack=openblas -Dgpu=true --reconfigure --buildtype=debug to get debug info, resulted in ssidst showing up on nvidia-smi as before, but now the tests get further. However, all the cpu cores are lighting up like crazy, and gpu usage still seems insubstantial. So not sure if the debug build is useful for testing gpu code?

...
=======================
Testing random matrices
=======================
- no. 1 n = 1 nza = 1... num_flops: 0.0 ok...
- no. 2 n = 2 nza = 2... num_flops: 0.0 ok...
- no. 3 n = 3 nza = 4... num_flops: 0.0 ok...
+ no. 4 n = 4 nza = 8... num_flops: 0.0 ok...
- no. 5 n = 5 nza = 6... num_flops: 0.0 ok...
+ no. 6 n = 6 nza = 11... num_flops: 0.0 ok...
- no. 7 n = 7 nza = 21... num_flops: 0.0 ok...
+ no. 8 n = 8 nza = 29... num_flops: 0.0 ok...
- no. 9 n = 9 nza = 23... num_flops: 0.0 ok...
+ no. 10 n = 10 nza = 14... num_flops: 0.0 ok...
- no. 11 n = 11 nza = 48... num_flops: 0.0 ok...
+ no. 12 n = 12 nza = 66... num_flops: 0.0 ok...
- no. 13 n = 13 nza = 48... num_flops: 0.0 ok...
+ no. 14 n = 14 nza = 25... num_flops: 0.0 ok...
+ no. 15 n = 15 nza = 106... num_flops: 0.0 ok...
+ no. 16 n = 16 nza = 103... num_flops: 0.0 ok...
+ no. 17 n = 17 nza = 128... num_flops: 0.0 ok...
- no. 18 n = 18 nza = 101... num_flops: 0.0 ok...
- no. 19 n = 19 nza = 57... num_flops: 0.0 ok...
- no. 20 n = 20 nza = 127... num_flops: 0.0 ok...
- no. 21 n = 515 nza = 14222... num_flops: 30.8 ok...
- no. 22 n = 567 nza = 155788... num_flops: 60.8 ok...
+ no. 23 n = 983 nza = 351834... num_flops: 313.8 f+s fail residual 1d = 4.1563E-06
- no. 24 n = 79 nza = 2763... num_flops: 0.2 ok...
+ no. 25 n = 627 nza = 37510... num_flops: 69.6 ok...
+ no. 26 n = 641 nza = 80909... num_flops: 82.4 ok...
+ no. 27 n = 358 nza = 52651... num_flops: 15.1 ok...
- no. 28 n = 203 nza = 10571... num_flops: 2.5 ok...
+ no. 29 n = 87 nza = 2321... num_flops: 0.2 ok...
- no. 30 n = 233 nza = 16296... num_flops: 4.0 ok...
- no. 31 n = 504 nza = 36484... num_flops: 37.7 ok...
+ no. 32 n = 351 nza = 60906... num_flops: 14.5 ok...
+ no. 33 n = 264 nza = 28539... num_flops: 6.0 ok...
+ no. 34 n = 275 nza = 8401... num_flops: 5.2 ok...
- no. 35 n = 788 nza = 13262... num_flops: 88.5 ok...
+ no. 36 n = 604 nza = 69751... num_flops: 68.8 ok...
+ no. 37 n = 456 nza = 74661... num_flops: 31.0 ok...
- no. 38 n = 42 nza = 513... num_flops: 0.0 ok...
- no. 39 n = 794 nza = 8831... num_flops: 69.2 ok...
- no. 40 n = 412 nza = 37445... num_flops: 21.6 ok...
+ no. 41 n = 548 nza = 54047... num_flops: 50.5 ok...
+ no. 42 n = 302 nza = 40691... num_flops: 9.1 ok...
+ no. 43 n = 650 nza = 105746... num_flops: 87.9 ok...
- no. 44 n = 865 nza = 247486... num_flops: 212.8 ok...
+ no. 45 n = 558 nza = 101182... num_flops: 56.6 f+s fail residual 1d = 1.0518E-05
+ no. 46 n = 397 nza = 41803... num_flops: 19.8 ok...
+ no. 47 n = 716 nza = 199676... num_flops: 121.3 ok...
+ no. 48 n = 264 nza = 29965... num_flops: 6.1 ok...
+ no. 49 n = 403 nza = 57856... num_flops: 21.3 ok...
+ no. 50 n = 294 nza = 4105... num_flops: 4.1 ok...
+ no. 51 n = 850 nza = 277323... num_flops: 203.2 ok...
- no. 52 n = 947 nza = 287325... num_flops: 278.2 ok...
+ no. 53 n = 817 nza = 182317... num_flops: 177.4 ok...
+ no. 54 n = 164 nza = 6058... num_flops: 1.2 ok...
- no. 55 n = 883 nza = 381327... num_flops: 229.7 ok...
- no. 56 n = 786 nza = 25896... num_flops: 116.2 f+s fail residual 1d = 1.1859E-05
- no. 57 n = 503 nza = 116472... num_flops: 42.3 ok...
+ no. 58 n = 338 nza = 4409... num_flops: 5.7 ok...
- no. 59 n = 117 nza = 5616... num_flops: 0.5 ok...
- no. 60 n = 588 nza = 44159... num_flops: 59.2 ok...
- no. 61 n = 413 nza = 29136... num_flops: 20.9 ok...
+ no. 62 n = 170 nza = 11659... num_flops: 1.6 ok...
+ no. 63 n = 834 nza = 52334... num_flops: 162.2 ok...
- no. 64 n = 141 nza = 9529... num_flops: 0.9 ok...
- no. 65 n = 434 nza = 86614... num_flops: 27.2 ok...
- no. 66 n = 507 nza = 116449... num_flops: 43.3 ok...
- no. 67 n = 930 nza = 242955... num_flops: 262.1 ok...
- no. 68 n = 799 nza = 50475... num_flops: 142.6 ok...
+ no. 69 n = 997 nza = 158205... num_flops: 311.5 f+s fail residual 1d = 8.6189E-06
- no. 70 n = 247 nza = 9624... num_flops: 4.1 ok...
- no. 71 n = 488 nza = 3036... num_flops: 7.3 ok...
- no. 72 n = 69 nza = 2279... num_flops: 0.1 ok...
+ no. 73 n = 14 nza = 82... num_flops: 0.0 ok...
+ no. 74 n = 129 nza = 4316... num_flops: 0.6 ok...
- no. 75 n = 40 nza = 74... num_flops: 0.0 ok...
+ no. 76 n = 153 nza = 5185... num_flops: 1.0 ok...
+ no. 77 n = 879 nza = 264837... num_flops: 223.1 ok...
- no. 78 n = 449 nza = 13542... num_flops: 21.4 ok...
+ no. 79 n = 520 nza = 20941... num_flops: 35.9 ok...
- no. 80 n = 923 nza = 26956... num_flops: 183.1 ok...
- no. 81 n = 413 nza = 32311... num_flops: 21.3 ok...
+ no. 82 n = 759 nza = 188728... num_flops: 143.7 ok...
+ no. 83 n = 155 nza = 9814... num_flops: 1.2 ok...
- no. 84 n = 700 nza = 156429... num_flops: 112.2 ok...
- no. 85 n = 709 nza = 114557... num_flops: 113.8 ok...
+ no. 86 n = 128 nza = 5555... num_flops: 0.6 ok...
- no. 87 n = 801 nza = 82565... num_flops: 155.7 ok...
+ no. 88 n = 201 nza = 12906... num_flops: 2.5 ok...
+ no. 89 n = 24 nza = 245... num_flops: 0.0 ok...
- no. 90 n = 136 nza = 6855... num_flops: 0.8 ok...
+ no. 91 n = 905 nza = 254164... num_flops: 243.2 ok...
+ no. 92 n = 932 nza = 311521... num_flops: 267.1 f+s fail residual 1d = 1.0784E-06
- no. 93 n = 572 nza = 64007... num_flops: 58.1 ok...
- no. 94 n = 474 nza = 56416... num_flops: 33.8 ok...
- no. 95 n = 817 nza = 299142... num_flops: 181.4 ok...
- no. 96 n = 728 nza = 98680... num_flops: 121.0 f+s fail residual 1d = 3.5315E-06
- no. 97 n = 102 nza = 1086... num_flops: 0.2 ok...
- no. 98 n = 368 nza = 15273... num_flops: 13.2 ok...
- no. 99 n = 528 nza = 102200... num_flops: 48.4 ok...
+ no. 100 n = 580 nza = 86174... num_flops: 62.6 f+s fail residual 1d = 5.7651E-06

================================
Testing random matrices (scaled)
================================
* no. 1 n = 1 nza = 1 scal = 3... num_flops: 0.0 ok...ok...
* no. 2 n = 2 nza = 2 scal = 3... num_flops: 0.0 ok...ok...
* no. 3 n = 3 nza = 4 scal = 3... num_flops: 0.0 ok...ok...
* no. 4 n = 4 nza = 6 scal = 0... num_flops: 0.0 ok...ok...
* no. 5 n = 5 nza = 12 scal = 1... num_flops: 0.0 ok...ok...
* no. 6 n = 6 nza = 18 scal = 0... num_flops: 0.0 ok...ok...
* no. 7 n = 7 nza = 14 scal = 2... num_flops: 0.0 ok...ok...
* no. 8 n = 8 nza = 25 scal = 3... num_flops: 0.0 ok...ok...
* no. 9 n = 9 nza = 15 scal = 2... num_flops: 0.0 ok...ok...
* no. 10 n = 10 nza = 28 scal = 0... num_flops: 0.0 ok...ok...
* no. 11 n = 11 nza = 54 scal = 4... num_flops: 0.0 ok...ok...
* no. 12 n = 12 nza = 30 scal = 3... num_flops: 0.0 ok...ok...
* no. 13 n = 13 nza = 27 scal = 2... num_flops: 0.0 ok...ok...
* no. 14 n = 14 nza = 91 scal = 3... num_flops: 0.0 ok...ok...
* no. 15 n = 15 nza = 89 scal = 2... num_flops: 0.0 ok...ok...
* no. 16 n = 16 nza = 78 scal = 2... num_flops: 0.0 ok...ok...
* no. 17 n = 17 nza = 50 scal = 3... num_flops: 0.0 ok...ok...
* no. 18 n = 18 nza = 72 scal = 1... num_flops: 0.0 ok...ok...
* no. 19 n = 19 nza = 61 scal = 3... num_flops: 0.0 ok...ok...
* no. 20 n = 20 nza = 192 scal = 3... num_flops: 0.0 ok...ok...
* no. 21 n = 124 nza = 2411 scal = 3... num_flops: 0.6 ok...ok...
* no. 22 n = 474 nza = 43026 scal = 3... num_flops: 34.1 ok...ok...
* no. 23 n = 117 nza = 879 scal = 3... num_flops: 0.3 ok...ok...
* no. 24 n = 316 nza = 16096 scal = 3... num_flops: 10.5 ok...ok...
* no. 25 n = 337 nza = 6829 scal = 1... num_flops: 8.8 ok...ok...
* no. 26 n = 188 nza = 12614 scal = 1... num_flops: 2.2 ok...ok...
* no. 27 n = 19 nza = 88 scal = 3... num_flops: 0.0 ok...ok...
* no. 28 n = 3 nza = 4 scal = 1... num_flops: 0.0 ok...ok...
* no. 29 n = 407 nza = 81243 scal = 4... num_flops: 22.6 ok...ok...
* no. 30 n = 488 nza = 17448 scal = 0... num_flops: 35.6 ok...ok...
* no. 31 n = 111 nza = 1096 scal = 2... num_flops: 0.4 ok...ok...
* no. 32 n = 354 nza = 61816 scal = 2... num_flops: 14.9 ok...ok...
* no. 33 n = 198 nza = 3972 scal = 2... num_flops: 2.4 ok...ok...
* no. 34 n = 65 nza = 1008 scal = 2... num_flops: 0.1 ok...ok...
* no. 35 n = 94 nza = 4163 scal = 4... num_flops: 0.3 ok...ok...
* no. 36 n = 474 nza = 95490 scal = 3... num_flops: 35.6 ok...ok...
* no. 37 n = 325 nza = 50545 scal = 0... num_flops: 11.5 ok...ok...
* no. 38 n = 162 nza = 12378 scal = 3... num_flops: 1.4 ok...ok...
* no. 39 n = 484 nza = 33682 scal = 0... num_flops: 34.2 ok...ok...
* no. 40 n = 162 nza = 9830 scal = 2... num_flops: 1.4 ok...ok...
* no. 41 n = 362 nza = 63426 scal = 1... num_flops: 15.9 ok...ok...
* no. 42 n = 251 nza = 24079 scal = 1... num_flops: 5.3 ok...ok...
* no. 43 n = 264 nza = 14405 scal = 3... num_flops: 6.0 ok...ok...
* no. 44 n = 434 nza = 65176 scal = 0... num_flops: 27.1 ok...ok...
* no. 45 n = 136 nza = 5431 scal = 0... num_flops: 0.8 ok...ok...
* no. 46 n = 131 nza = 7183 scal = 4... num_flops: 0.7 ok...ok...
* no. 47 n = 399 nza = 65230 scal = 0... num_flops: 21.2 ok...ok...
* no. 48 n = 424 nza = 18273 scal = 2... num_flops: 24.0 ok...ok...
* no. 49 n = 427 nza = 2525 scal = 2... num_flops: 5.5 ok...ok...
* no. 50 n = 345 nza = 58137 scal = 3... num_flops: 13.7 ok...ok...
* no. 51 n = 497 nza = 34866 scal = 4... num_flops: 39.5 ok...ok...
* no. 52 n = 230 nza = 21796 scal = 3... num_flops: 4.1 ok...ok...
* no. 53 n = 261 nza = 14531 scal = 1... num_flops: 5.7 ok...ok...
* no. 54 n = 407 nza = 53876 scal = 2... num_flops: 22.5 ok...ok...
* no. 55 n = 293 nza = 5280 scal = 3... num_flops: 6.4 ok...ok...
* no. 56 n = 10 nza = 44 scal = 2... num_flops: 0.0 ok...ok...
* no. 57 n = 337 nza = 48834 scal = 0... num_flops: 12.8 ok...ok...
* no. 58 n = 316 nza = 4741 scal = 2... num_flops: 8.8 ok...ok...
* no. 59 n = 463 nza = 97888 scal = 0... num_flops: 33.1 ok...ok...
* no. 60 n = 217 nza = 21784 scal = 3... num_flops: 3.4 ok...ok...
* no. 61 n = 438 nza = 94063 scal = 3... num_flops: 28.1 ok...ok...
* no. 62 n = 19 nza = 67 scal = 3... num_flops: 0.0 ok...ok...
* no. 63 n = 37 nza = 386 scal = 1... num_flops: 0.0 ok...ok...
* no. 64 n = 113 nza = 3488 scal = 0... num_flops: 0.5 ok...ok...
* no. 65 n = 67 nza = 1515 scal = 2... num_flops: 0.1 ok...ok...
* no. 66 n = 481 nza = 2447 scal = 3... num_flops: 10.4 ok...ok...
* no. 67 n = 278 nza = 33399 scal = 4... num_flops: 7.2 ok...ok...
* no. 68 n = 473 nza = 54681 scal = 3... num_flops: 35.4 ok...ok...
* no. 69 n = 381 nza = 55624 scal = 3... num_flops: 18.4 ok...ok...
* no. 70 n = 322 nza = 14769 scal = 2... num_flops: 10.8 ok...ok...
* no. 71 n = 317 nza = 30213 scal = 3... num_flops: 10.7 ok...ok...
* no. 72 n = 199 nza = 11082 scal = 0... num_flops: 2.6 ok...ok...
* no. 73 n = 115 nza = 4790 scal = 1... num_flops: 0.5 ok...ok...
* no. 74 n = 394 nza = 56672 scal = 3... num_flops: 20.4 ok...ok...
* no. 75 n = 15 nza = 29 scal = 3... num_flops: 0.0 ok...ok...
* no. 76 n = 337 nza = 38308 scal = 3... num_flops: 12.8 ok...ok...
* no. 77 n = 285 nza = 8146 scal = 3... num_flops: 7.7 ok...ok...
* no. 78 n = 164 nza = 3306 scal = 0... num_flops: 1.4 ok...ok...
* no. 79 n = 454 nza = 67910 scal = 4... num_flops: 31.1 ok...ok...
* no. 80 n = 224 nza = 7980 scal = 1... num_flops: 3.7 ok...ok...
* no. 81 n = 207 nza = 20485 scal = 1... num_flops: 3.0 ok...ok...
* no. 82 n = 294 nza = 36517 scal = 3... num_flops: 8.5 ok...ok...
* no. 83 n = 196 nza = 2624 scal = 4... num_flops: 2.2 ok...ok...
* no. 84 n = 142 nza = 7420 scal = 1... num_flops: 1.0 ok...ok...
* no. 85 n = 114 nza = 3658 scal = 4... num_flops: 0.4 ok...ok...
* no. 86 n = 290 nza = 17713 scal = 4... num_flops: 7.4 ok...ok...
* no. 87 n = 190 nza = 3800 scal = 4... num_flops: 2.2 ok...ok...
* no. 88 n = 420 nza = 34281 scal = 3... num_flops: 24.5 ok...ok...
* no. 89 n = 386 nza = 58993 scal = 0... num_flops: 19.2 ok...ok...
* no. 90 n = 311 nza = 36244 scal = 3... num_flops: 10.1 ok...ok...
* no. 91 n = 98 nza = 2421 scal = 2... num_flops: 0.3 ok...ok...
* no. 92 n = 101 nza = 1081 scal = 1... num_flops: 0.3 ok...ok...
* no. 93 n = 236 nza = 18692 scal = 0... num_flops: 4.4 ok...ok...
* no. 94 n = 317 nza = 48559 scal = 3... num_flops: 10.7 ok...ok...
* no. 95 n = 200 nza = 13183 scal = 0... num_flops: 2.6 ok...ok...
* no. 96 n = 157 nza = 7607 scal = 2... num_flops: 1.3 ok...ok...
* no. 97 n = 37 nza = 141 scal = 1... num_flops: 0.0 ok...ok...
* no. 98 n = 490 nza = 92916 scal = 1... num_flops: 39.2 ok...ok...
* no. 99 n = 51 nza = 121 scal = 2... num_flops: 0.0 ok...ok...
* no. 100 n = 378 nza = 66946 scal = 0... num_flops: 18.1 ok...ok...

==================
Testing big matrix
==================
* n = 2000 nza = 10000... num_flops: 492.4 ok...
==========================
Total number of errors = 7
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
STOP 1
jfowkes commented 7 months ago

Many thanks for the very detailed bug report! It does indeed look like SPRAL is broken on GPU…

@AndrewLister-STFC @haldaas @tyronerees what do you guys think?