Open jandrej opened 3 years ago
Assuming you configured with –enable-mixed-int, this is the correct way to run it. Which machine are you running it on? Did this problem work for you running it with 64-bit integers?
From: Julian Andrej @.> Sent: Monday, April 12, 2021 7:59 AM To: hypre-space/hypre @.> Cc: Subscribed @.***> Subject: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)
In the process of getting mfem to work with the HYPRE_MIXEDINT option (see mfem/mfem#1583https://urldefense.us/v3/__https:/github.com/mfem/mfem/pull/1583__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7z9nn25j$) we are running into issues.
I tried to run the current hypre version (recent git master) using the ij test executable with
$ srun -n1728 -ppbatch -A *** ./test/ij -P 12 12 12 -n 1400 1400 1400
to run a large enough test. This fails with a memory
ij: hypre_memory.c:34: hypre_OutOfMemory: Assertion `0' failed.
[***:mpi_rank_341][error_sighandler] Caught error: Aborted (signal 6)
Am I using the option wrong?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7-Q7Olp1$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLKVMQELOBOOYWGSYWTTIMDEXANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7yxSTfNt$.
Yes I configured with --enable-mixed-int. I ran the test on quartz. I did not try to run with the bigint option only, but I can do that if that helps.
It’s possible that you just run out of memory, since this a very large problem and if the 64-bit integer works, there really is an issue with the mixed-int version. Another thing you could try with the mixed-int version, which would use less memory is to add -agg_nl 1 to your command line for an AMG version with lower complexity and memory requirements.
From: Julian Andrej @.> Sent: Monday, April 12, 2021 8:26 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)
Yes I configured with --enable-mixed-int. I ran the test on quartz. I did not try to run with the bigint option only, but I can do that if that helps.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-817904268__;Iw!!G2kpM7uM-TzIFchu!iHVvh0X0QJBhoRatVyvECuHScUmaG3L85ANu1pM12WfatcJ4DKcv4aJCV3RDiHNY$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLJT3B3GPPYMOUUJTVLTIMGHPANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!iHVvh0X0QJBhoRatVyvECuHScUmaG3L85ANu1pM12WfatcJ4DKcv4aJCV8LEBN4F$.
Using your suggested options
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1400 1400 1400 -agg_nl 1
works fine.
Thanks!
I tried another option since in mfem a simple Laplace problem works fine with mixed-int. Elasticity with the systems option fails.
When I run
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1200 1200 1200 -sysL 2 -agg_nl 1 -intptype 10
the example segfaults (without further information about errors etc.)
Is the combination of -agg_nl 1 -intptype 10
supposed to work on -sysL 2
? I expect this to produce a 7pt stencil with 2 equations.
This problem is now twice as big as the previous one, so it is possible you ran out of memory. You also solve this as a scalar problem but using a nodal interpolation (however that should not be the problem). However, we haven’t really tested this interpolation for mixed-int, so there could be an issue. Can you try to rerun setting -n 1200 1200 600? From: Julian Andrej @.> Sent: Monday, April 12, 2021 10:30 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)
I tried another option since in mfem a simple Laplace problem works fine with mixed-int. Elasticity with the systems option fails.
When I run
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1200 1200 1200 -sysL 2 -agg_nl 1 -intptype 10
the example segfaults (without further information about errors etc.)
Is the combination of -agg_nl 1 -intptype 10 supposed to work on -sysL 2? I expect this to produce a 7pt stencil with 2 equations.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-817992045__;Iw!!G2kpM7uM-TzIFchu!m3GWs2Q4L8IW4b5zto7GaxTkqTJMxkYcCZnSUJK1R1wuis2l5A5hNn_JH_PXAHCV$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLMCMBABNDM2IXOKKTDTIMUYDANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!m3GWs2Q4L8IW4b5zto7GaxTkqTJMxkYcCZnSUJK1R1wuis2l5A5hNn_JHys-WYWt$.
The test also fails with a much smaller allocation
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10
so I don't suspect running OOM here.
Probably not, but this is also a problem that is much smaller and can be solved with 32bits only, so I don’t expect this to be a mixed-int problem. You generally would not run this as you have. Can you add -nf 2 -nodal 1 to this and see what happens? Thanks
From: Julian Andrej @.> Sent: Monday, April 12, 2021 11:04 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)
The test also fails with a much smaller allocation
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10
so I don't suspect running OOM here.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-818014580__;Iw!!G2kpM7uM-TzIFchu!kBa1g5esB6Hvdo6DP6TKLh96SBbPAgDoeGfdeLvPjLj7cDWbVFgz6aZ0mizNl8RA$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLP3CLZJGY7YK4IBHLTTIMYZDANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!kBa1g5esB6Hvdo6DP6TKLh96SBbPAgDoeGfdeLvPjLj7cDWbVFgz6aZ0ml2G4QzE$.
Running
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10 -nf 2 -nodal 1
segfaults without further information
I just realize that it doesn’t make sense to combine interptype 10 with aggressive coarsening. It also fails on a small problem with 2 processes. Obviously, this should not just segfault, so we have to do something about that. For now remove -agg_nl 1.
From: Julian Andrej @.> Sent: Monday, April 12, 2021 11:25 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)
Running
$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10 -nf 2 -nodal 1
segfaults without further information
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-818029432__;Iw!!G2kpM7uM-TzIFchu!hzzs1ga18ybSqRNxsyC2yt8JY3VckiWTwho6kh1mmXKHUKq0cFg8GTEnQvqv4aiK$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLN7IVYHCIPZSQ4OFB3TIM3IZANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!hzzs1ga18ybSqRNxsyC2yt8JY3VckiWTwho6kh1mmXKHUKq0cFg8GTEnQnGTl-Ij$.
I'm also having problems with mixedint tests, presumably the same or related to the problem reported here. Building 2.21.0 (patch set here)
After building the mixedint library and tests, TEST_ams for instance generates this output (corresponding to solvers.out.10 from TEST_ams/solvers.jobs. Segfault also occurs for out.8, 9 and 11. I have 8 processors on this system, if that's relevant)
$ cd /build/hypre-64m/src/test/TEST_ams
$ ln -s ../ams_driver
$ LD_LIBRARY_PATH=/build/hypre-64m/src/lib:$LD_LIBRARY_PATH mpirun -np 4 ./ams_driver -solver 5 -tol 1e-4 -h1
Problem size: 5080
=============================================
Setup phase times:
=============================================
AME Setup:
wall clock time = 0.010000 seconds
wall MFLOPS = 0.000000
cpu clock time = 0.008005 seconds
cpu MFLOPS = 0.000000
Solving generalized eigenvalue problem with preconditioning
block size 5
No constraints
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node sandy exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I'm also building the standard and bigint configurations. bigint does not generate an mpirun segfault.
This is with openmpi 4.1.0 and pmix 4.0.0.
More distressing than the mpi segfault itself, the error correlates with a complete linux kernel meltdown. For some reason the hard drive bus seems to get detached after the mpi segfault is triggered, causing all filesystems to be dropped to read-only, hence a complete system failure. Since /var is also made read-only, I can't give an exact log of this behaviour.
My kernel crash is reproducible in the sense that it is currently happening every time I run mixedint tests, but does not occur with bigint tests. However the precise point at which the filesystem lockup occurs varies, sometimes during TEST_ams. sometimes TEST_ij, or TEST_lobpcg. More often in TEST_lobpcg.
With respect to the workaround suggested above, there is no -agg_nl
for ams_driver
. The mpi segfault reported here seems to affect several mixedint tests, not only ij
but also amd_driver
, sstruct
, struct
.
In the process of getting mfem to work with the
HYPRE_MIXEDINT
option (see https://github.com/mfem/mfem/pull/1583) we are running into issues.I tried to run the current hypre version (recent git master) using the
ij
test executable withto run a large enough test. This fails with a memory
Am I using the option wrong?