Open LiuZhexuan opened 1 month ago
I tried to use spral_ssids.exe for these two matrix just now. Console show ND/nd3k and PARSEC/Si10H16 time is ~0.5s/1.1s respectively. Below are logging from console: D:\Desktop\111\bin> ./spral_ssids nd3k.rb --scale=auction --nrhs 2 Set scaling to Auction solving for 2 right-hand sides Reading 'nd3k.rb'... ok Forcing topology to 32 Using 0 GPUs Used order 1 ok Analyse took 0.405999988 Predict nfact = 1.49E+07 Predict nflop = 2.99E+10 nparts 1 cpu_fl 2.99E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 0.485000014 Solve... ok Solve took 3.09999995E-02 number bad cmp = 0 fwd error || ||_inf = 5.9573790345268662E-011 bwd error scaled = 6.3086194706343708E-016 6.3086194706343708E-016 cmp: SMFCT anal: 0.41 fact: 0.49 afact: 1.49E+07 aflop: 2.99E+10 nfact: 1.49E+07 nflop: 2.99E+10 delay: 0 inerti 0 0 9000 2x2piv 454 maxfro 3083 maxsup 2231 not_fi 0 not_se 0
D:\Desktop\111\bin> ./spral_ssids Si10H16.rb --scale=auction --nrhs 1 Set scaling to Auction solving for 1 right-hand sides Reading 'Si10H16.rb'... ok Forcing topology to 32 Using 0 GPUs Used order 1 ok Analyse took 0.296999991 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.01600003 Solve... ok Solve took 3.09999995E-02 number bad cmp = 0 fwd error || ||_inf = 5.3225091001252167E-011 bwd error scaled = 9.7885138181843040E-013 cmp: SMFCT anal: 0.30 fact: 1.02 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0
The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.
The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.
Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC
Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC
Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?
Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC
Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?
I used the default setting both for MSVC and spral_ssids.exe. Attached is my code: main.zip
Right but are you sure the default options are the same for both? In the spral_ssids.exe
example above you're passing in --scale=auction
for auction scaling. But the default is to use no scaling when spral_ssids_default_options
is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types
Right but are you sure the default options are the same for both? In the
spral_ssids.exe
example above you're passing in--scale=auction
for auction scaling. But the default is to use no scaling whenspral_ssids_default_options
is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types
I tried again by not passing --scale=aution, time consumption is almost the same. Below are two test run under R5-5600X. Factorize using MSVC is still much slower than exe on this computer(~4.5s).
PS E:\test> .\spral_ssids.exe Si10H16.rb Reading 'Si10H16.rb'... ok Forcing topology to 12 Using 0 GPUs Used order 1 ok Analyse took 0.375000000 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.71899998 Solve... ok Solve took 1.60000008E-02 number bad cmp = 0 fwd error || ||_inf = 6.3399951955034339E-011 bwd error scaled = 1.8274520805995796E-012 cmp: SMFCT anal: 0.38 fact: 1.72 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0 PS E:\test> .\spral_ssids.exe Si10H16.rb --scale=auction Set scaling to Auction Reading 'Si10H16.rb'... ok Forcing topology to 12 Using 0 GPUs Used order 1 ok Analyse took 0.360000014 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.71800005 Solve... ok Solve took 1.60000008E-02 number bad cmp = 0 fwd error || ||_inf = 3.3993030612577968E-011 bwd error scaled = 1.7394675123364451E-012 cmp: SMFCT anal: 0.36 fact: 1.72 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0
Interesting, the spral_ssids.exe
is compiled using MinGW's gfortran compiler and not MSVC so it may be that.
Note that the library libspral.dll
provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW.
Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model).
Except for BLAS / LAPACK, I don't think we see differences in practice.
Note that the library
libspral.dll
provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW. Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model). Except for BLAS / LAPACK, I don't think we see differences in practice.
The blas provided by precompiled lib is openblas, which will also be used by spral_ssids.exe
I am using the lateset precompiled binary of SSIDS. I tested matrix ND/nd3k and PARSEC/Si10H16 and fact+solve time are around 1.5s/4.5s respectively. But in the SSDIS prepint A Sparse symmetric indefinite direct solver for GPU architectures, these two matrix can be solved in less than 0.5 second with 2 E5-2687W. I am using i9-14900K and all cores are at full load during runnning. Considering this is a 10 years later CPU with 24 cores, I think time consumption should be less. Can someone help to test same matrix on similar platform? Thanks!