Closed hufngvuowng closed 1 year ago
It is hard to say without more information. Can you tell us what compiler, operating system and boost version you are using?
Sandeep.
On Wed, May 24, 2023 at 7:17 PM Hung Vuong @.***> wrote:
Hello,
I followed the same instructions in the README to compiled both Boost and DICE (with mpicxx), but running SHCI/runTests.sh yielded this segfault error on the restart test. I don't get this error if I set mpirun -np 1, but any greater number of processes yielded the error. For reference I set both USE_INTEL and HAS_AVX2 to False and disable MKL include and lib in the Makefile.
Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: (nil) [ 0] /lib64/libpthread.so.0(+0x12c20)[0x155553a0dc20] [ 1] ../../../bin/Dice[0x4eb92a] [ 2] ../../../bin/Dice[0x4f74eb] [ 3] ../../../bin/Dice[0x41c548] [ 4] /lib64/libc.so.6(__libc_start_main+0xf3)[0x155553659493] [ 5] ../../../bin/Dice[0x41ff3e]
Do you know what can cause this error? Thanks!
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW4EFYWUYHHJTROD5IM3XH26J7ANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi Sandeep,
I'm using g++ 10.2.0, Boost 1.80.0 on Rocky Linux 8.4.
Hung
Are you able to run other programs using mpi? Also do you know which example in the test it is failing for. Can you run that input explicitly on a command line and see what the output is.
Sandeep.
On Wed, May 24, 2023 at 8:30 PM Hung Vuong @.***> wrote:
Hi Sandeep,
I'm using G++ 10.2.0, Boost 1.80.0 on Rocky Linux 8.4.
Hung
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#issuecomment-1562205715, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW4FAY53WAVYS7HF765TXH3G33ANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you commented.Message ID: @.***>
Hi Sandeep,
I'm able to run other MPI programs. The segfault errors are from the restart
(tests/SHCI/restart
, and tests/SHCI/restart_trev
). Particularly for the input3.dat (input2.dat ran fine):
mpirun -np 2 ../../../bin/Dice input3.dat | tee output3.dat
This was also printed out in the output3.dat before when I ran the above command in the tests' directories:
SELECTING REFERENCE DETERMINANT(S)
**************************************************************
2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Given Ref. Energy: -108.9541250311
**************************************************************
VARIATIONAL STEP
**************************************************************
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: (nil)
0
build det to index
starts to precondition
[ 0] /lib64/libpthread.so.0(+0x12c20)[0x155553a0dc20]
[ 1] ../../../bin/Dice[0x4eb92a]
[ 2] ../../../bin/Dice[0x4f74eb]
[ 3] ../../../bin/Dice[0x41c548]
[ 4] /lib64/libc.so.6(__libc_start_main+0xf3)[0x155553659493]
[ 5] ../../../bin/Dice[0x41ff3e]
*** End of error message ***
It ran with -np 1
in mpirun
though. Additionally full restart
(tests/SHCI/full_restart/
) also failed the tests despite not having segfault error.
Hung
Interesting, so it is perhaps related to some conflict between boost_Serialize and boost_mpi. I have not seen this error before. Besides the restart files all others work?
Sandeep.
On Thu, May 25, 2023 at 7:35 AM Hung Vuong @.***> wrote:
Hi Sandeep,
I'm able to run other MPI programs. The segfault errors are from the restart (tests/SHCI/restart, and tests/SHCI/restart_trev). Particularly for the input3.dat (input2.dat ran fine):
mpirun -np 2 ../../../bin/Dice input3.dat | tee output3.dat
This was also printed out in the output3.dat before when I ran the above command in the tests' directories:
Process received signal Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: (nil) 0 build det to index starts to precondition [ 0] /lib64/libpthread.so.0(+0x12c20)[0x155553a0dc20] [ 1] ../../../bin/Dice[0x4eb92a] [ 2] ../../../bin/Dice[0x4f74eb] [ 3] ../../../bin/Dice[0x41c548] [ 4] /lib64/libc.so.6(__libc_start_main+0xf3)[0x155553659493] [ 5] ../../../bin/Dice[0x41ff3e] End of error message
Additionally full restart (tests/SHCI/full_restart/) also failed the tests despite not having segfault error.
Hung
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#issuecomment-1563020866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW4BIXRUQ3KMKAWCLFEDXH5UZ7ANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you commented.Message ID: @.***>
Yes, all other tests passed. Do you think this is is something to do with the Boost version? We also have some bus errors issues running Dice on our local cluster (as in issue #9).
It is hard to say for sure without looking into it in more detail, on our cluster we have a much older version of gcc and boost. So it is possible a bug has crept in with the new compilers/boost. Are you in Columbia? Maybe you can ask Ankit who in Dave's group to have a look, if he also has access to the cluster you are working on. The bus error usually happens when you rewrite or remove the executable while the code is running e.g. you are recomiping and trying a few things while tests are running.
Sandeep.
On Thu, May 25, 2023 at 9:02 AM Hung Vuong @.***> wrote:
Yes, all other tests passed. Do you think this is is something to do with the Boost version? We also have some bus errors issues running Dice on our local cluster (as in issue #9 https://github.com/sanshar/Dice/issues/9).
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#issuecomment-1563154888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW4BEF5PYBE6TUSXLVUTXH57CRANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you commented.Message ID: @.***>
Hi Sandeep,
Thanks for letting me know. I'll contact Ankit to see if he knows what's going on.
For the bus error, I didn't recompile or alter the Dice executable and
boost. For example when launching MPI tasks on 1 node I get the bus error
on one but the other kept running. I also checked with top
and there was
only 1 process running. (We're using openmpi/4.1.1) I can give you the
input and slurm submission script if that helps.
Hung
On Thu, May 25, 2023, 12:07 PM Sandeep Sharma @.***> wrote:
It is hard to say for sure without looking into it in more detail, on our cluster we have a much older version of gcc and boost. So it is possible a bug has crept in with the new compilers/boost. Are you in Columbia? Maybe you can ask Ankit who in Dave's group to have a look, if he also has access to the cluster you are working on. The bus error usually happens when you rewrite or remove the executable while the code is running e.g. you are recomiping and trying a few things while tests are running.
Sandeep.
On Thu, May 25, 2023 at 9:02 AM Hung Vuong @.***> wrote:
Yes, all other tests passed. Do you think this is is something to do with the Boost version? We also have some bus errors issues running Dice on our local cluster (as in issue #9 <https://github.com/sanshar/Dice/issues/9 ).
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#issuecomment-1563154888, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AABVW4BEF5PYBE6TUSXLVUTXH57CRANCNFSM6AAAAAAYOFWKAQ
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#issuecomment-1563161387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJRPBP23562HH4PQHTUAITLXH57TRANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you authored the thread.Message ID: @.***>
Hi @sanshar! I'm closing this issue because Ankit helped us resolve it the other day. Basically the schedule
struct was missing cdfci_on
and cdfciTol
(SHCI/input.h line 107) so these weren't passed properly during MPI broadcast.
Thank you.
Sandeep.
On Tue, May 30, 2023 at 4:46 PM Hung Vuong @.***> wrote:
Closed #11 https://github.com/sanshar/Dice/issues/11 as completed.
— Reply to this email directly, view it on GitHub https://github.com/sanshar/Dice/issues/11#event-9383964438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABVW4FW7OD3XEY5G2PMKLLXI2BHBANCNFSM6AAAAAAYOFWKAQ . You are receiving this because you were mentioned.Message ID: @.***>
Hello,
I followed the same instructions in the README to compiled both Boost and DICE (with mpicxx), but running
SHCI/runTests.sh
yielded this segfault error on therestart test
. I don't get this error if I setmpirun -np 1
, but any greater number of processes yielded the error. For reference I set bothUSE_INTEL
andHAS_AVX2
to False and disable MKL include and lib in the Makefile.Do you know what can cause this error? Thanks!