Performance on Windows is worse than prepint

LiuZhexuan commented 1 month ago

I am using the lateset precompiled binary of SSIDS. I tested matrix ND/nd3k and PARSEC/Si10H16 and fact+solve time are around 1.5s/4.5s respectively. But in the SSDIS prepint A Sparse symmetric indefinite direct solver for GPU architectures, these two matrix can be solved in less than 0.5 second with 2 E5-2687W. I am using i9-14900K and all cores are at full load during runnning. Considering this is a 10 years later CPU with 24 cores, I think time consumption should be less. Can someone help to test same matrix on similar platform? Thanks!

LiuZhexuan commented 1 month ago

I tried to use spral_ssids.exe for these two matrix just now. Console show ND/nd3k and PARSEC/Si10H16 time is ~0.5s/1.1s respectively. Below are logging from console: D:\Desktop\111\bin> ./spral_ssids nd3k.rb --scale=auction --nrhs 2 Set scaling to Auction solving for 2 right-hand sides Reading 'nd3k.rb'... ok Forcing topology to 32 Using 0 GPUs Used order 1 ok Analyse took 0.405999988 Predict nfact = 1.49E+07 Predict nflop = 2.99E+10 nparts 1 cpu_fl 2.99E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 0.485000014 Solve... ok Solve took 3.09999995E-02 number bad cmp = 0 fwd error || ||_inf = 5.9573790345268662E-011 bwd error scaled = 6.3086194706343708E-016 6.3086194706343708E-016 cmp: SMFCT anal: 0.41 fact: 0.49 afact: 1.49E+07 aflop: 2.99E+10 nfact: 1.49E+07 nflop: 2.99E+10 delay: 0 inerti 0 0 9000 2x2piv 454 maxfro 3083 maxsup 2231 not_fi 0 not_se 0

D:\Desktop\111\bin> ./spral_ssids Si10H16.rb --scale=auction --nrhs 1 Set scaling to Auction solving for 1 right-hand sides Reading 'Si10H16.rb'... ok Forcing topology to 32 Using 0 GPUs Used order 1 ok Analyse took 0.296999991 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.01600003 Solve... ok Solve took 3.09999995E-02 number bad cmp = 0 fwd error || ||_inf = 5.3225091001252167E-011 bwd error scaled = 9.7885138181843040E-013 cmp: SMFCT anal: 0.30 fact: 1.02 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0

jfowkes commented 1 month ago

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

LiuZhexuan commented 1 month ago

The precompiled SSIDS binaries are not optimised and are not intended to be, that would be impossible. Please compile your own version of SPRAL If you would like optimised performance, then you can use optimised BLAS/LAPACK etc.

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

jfowkes commented 1 month ago

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

LiuZhexuan commented 1 month ago

Thanks for reply! I will try to use menson later. But why the performance given by the executable spral_ssids.exe is much better? I think that exe is using the same dll used by MSVC

Indeed that is very strange, is It possible that they are using different default SSIDS options/settings?

I used the default setting both for MSVC and spral_ssids.exe. Attached is my code: main.zip

jfowkes commented 1 month ago

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

LiuZhexuan commented 1 month ago

Right but are you sure the default options are the same for both? In the spral_ssids.exe example above you're passing in --scale=auction for auction scaling. But the default is to use no scaling when spral_ssids_default_options is called: https://ralna.github.io/spral/_build/html/C/ssids.html#derived-types

I tried again by not passing --scale=aution, time consumption is almost the same. Below are two test run under R5-5600X. Factorize using MSVC is still much slower than exe on this computer(~4.5s).

PS E:\test> .\spral_ssids.exe Si10H16.rb Reading 'Si10H16.rb'... ok Forcing topology to 12 Using 0 GPUs Used order 1 ok Analyse took 0.375000000 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.71899998 Solve... ok Solve took 1.60000008E-02 number bad cmp = 0 fwd error || ||_inf = 6.3399951955034339E-011 bwd error scaled = 1.8274520805995796E-012 cmp: SMFCT anal: 0.38 fact: 1.72 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0 PS E:\test> .\spral_ssids.exe Si10H16.rb --scale=auction Set scaling to Auction Reading 'Si10H16.rb'... ok Forcing topology to 12 Using 0 GPUs Used order 1 ok Analyse took 0.360000014 Predict nfact = 3.18E+07 Predict nflop = 8.49E+10 nparts 1 cpu_fl 8.49E+10 gpu_fl 0.00E+00 Factorize... ok Factor took 1.71800005 Solve... ok Solve took 1.60000008E-02 number bad cmp = 0 fwd error || ||_inf = 3.3993030612577968E-011 bwd error scaled = 1.7394675123364451E-012 cmp: SMFCT anal: 0.36 fact: 1.72 afact: 3.18E+07 aflop: 8.49E+10 nfact: 3.18E+07 nflop: 8.49E+10 delay: 0 inerti 41 0 17036 2x2piv 904 maxfro 4448 maxsup 3418 not_fi 0 not_se 0

jfowkes commented 1 month ago

Interesting, the spral_ssids.exe is compiled using MinGW's gfortran compiler and not MSVC so it may be that.

amontoison commented 1 month ago

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW. Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model). Except for BLAS / LAPACK, I don't think we see differences in practice.

LiuZhexuan commented 1 month ago

Note that the library libspral.dll provided by the artifact is also compiled with GCC / GFortran / G++ compilers and MinGW. Precompiled artifacts should have optimized performance for a given architecture (x64, AArch64, etc.), but not for a specific microarchitecture (specific CPU model). Except for BLAS / LAPACK, I don't think we see differences in practice.

The blas provided by precompiled lib is openblas, which will also be used by spral_ssids.exe

ralna / spral

Performance on Windows is worse than prepint #202