pghysels / STRUMPACK

Structured Matrix Package (LBNL)
http://portal.nersc.gov/project/sparse/strumpack/
Other
167 stars 41 forks source link

Test Failure on MacOS -- SPARSE_HSS_mpi_21 #123

Open TobiasDuswald opened 2 months ago

TobiasDuswald commented 2 months ago

Dear Developers,

I tried installing STRUMPACK on MacOS 14.5, Xcode 15.4, M3 Pro (arm). I wanted to inform you that, on my system, test 119 SPARSE_HSS_mpi_21 fails; all other tests passed. I've tried releases v7.1.0, v7.1.4, and the latest master.

I installed the software and ran the tests with

mkdir build && mkdir install
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -Dmetis_PREFIX=../../metis-5.1.0 ..
make -j && make install
make examples -j
ctest --output-on-failure

Below you may find the output of the test. If you need more information, please let me know.

119/133 Test #119: SPARSE_HSS_mpi_21 ................***Failed    0.85 sec
# Running with:
# OMP_NUM_THREADS=1 mpirun -n 2 /Users/tobiasduswald/Software/STRUMPACK/build/test/test_sparse_mpi utm300/utm300.mtx --sp_compression HSS --hss_leaf_size 4 --hss_rel_tol 1e-1 --hss_abs_tol 1e-10 --hss_d0 16 --hss_dd 8 --sp_reordering_method metis --sp_compression_min_sep_size 25 
# opening file 'utm300/utm300.mtx'
# %%MatrixMarket matrix coordinate real general
# reading 300 by 300 matrix with 3,155 nnz's from utm300/utm300.mtx
# Initializing STRUMPACK
# using 1 OpenMP thread(s)
# using 2 MPI processes
# matching job: maximum matching with row and column scaling
# matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
# Matrix padded with zeros to get symmetric pattern.
# Number of nonzeros increased from 3,155 to 4,754.
# initial matrix:
#   - number of unknowns = 300
#   - number of nonzeros = 4,754
# nested dissection reordering:
#   - Metis reordering
#      - used METIS_NodeND (iso METIS_NodeNDP)
#      - supernodal tree was built from etree
#   - strategy parameter = 8
#   - number of separators = 153
#   - number of levels = 14
#   - nd time = 0.00114608
#   - matching time = 0.000522137
#   - symmetrization time = 8.4877e-05
# symbolic factorization:
#   - nr of dense Frontal matrices = 152
#   - nr of HSS Frontal matrices = 1
#   - symb-factor time = 0.000423908
#   - sep-reorder time = 0.000329971
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 0.095376 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.05367e-08
#   - replacing of small pivots is not enabled
#   - factor time = 0.000829935
#   - factor nonzeros = 21,142
#   - factor memory = 0.169136 MB
#   - compression = hss
#   - factor memory/nonzeros = 177.336 % of multifrontal
#   - maximum HSS rank = 5
#   - HSS relative compression tolerance = 0.1
#   - HSS absolute compression tolerance = 1e-10
#   - normal(0,1) distribution with minstd_rand engine
GMRES it. 0 res =      95.4366  rel.res =            1   restart!
GMRES it. 1 res =      3.85148  rel.res =    0.0403564
GMRES it. 2 res =      1.54211  rel.res =    0.0161585
GMRES it. 3 res =     0.060384  rel.res =  0.000632713
GMRES it. 4 res =    0.0215773  rel.res =   0.00022609
GMRES it. 5 res =  0.000985069  rel.res =  1.03217e-05
GMRES it. 6 res =  3.38379e-05  rel.res =  3.54559e-07
# DIRECT/GMRES solve:
#   - abs_tol = 1e-10, rel_tol = 1e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 6
#   - solve time = 0.00135398
# COMPONENTWISE SCALED RESIDUAL = 0.00201743
# RELATIVE ERROR = 0.000996645
residual too large
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[8847,1],0]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
prterun has exited due to process rank 0 with PID 0 on node w203-1b-v4 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
pghysels commented 2 months ago

Thank you for reporting. It looks like it's not really a bug, but a problem with the tolerances for this specific problem. You might get different results when running again, because the metis ordering might change.

TobiasDuswald commented 2 months ago

Alright, thanks for the info!