hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
697 stars 192 forks source link

Test problem for HYPRE_MIXEDINT #326

Open jandrej opened 3 years ago

jandrej commented 3 years ago

In the process of getting mfem to work with the HYPRE_MIXEDINT option (see https://github.com/mfem/mfem/pull/1583) we are running into issues.

I tried to run the current hypre version (recent git master) using the ij test executable with

$ srun -n1728 -ppbatch -A *** ./test/ij -P 12 12 12 -n 1400 1400 1400

to run a large enough test. This fails with a memory

ij: hypre_memory.c:34: hypre_OutOfMemory: Assertion `0' failed.
[***:mpi_rank_341][error_sighandler] Caught error: Aborted (signal 6)

Am I using the option wrong?

ulrikeyang commented 3 years ago

Assuming you configured with –enable-mixed-int, this is the correct way to run it. Which machine are you running it on? Did this problem work for you running it with 64-bit integers?

From: Julian Andrej @.> Sent: Monday, April 12, 2021 7:59 AM To: hypre-space/hypre @.> Cc: Subscribed @.***> Subject: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)

In the process of getting mfem to work with the HYPRE_MIXEDINT option (see mfem/mfem#1583https://urldefense.us/v3/__https:/github.com/mfem/mfem/pull/1583__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7z9nn25j$) we are running into issues.

I tried to run the current hypre version (recent git master) using the ij test executable with

$ srun -n1728 -ppbatch -A *** ./test/ij -P 12 12 12 -n 1400 1400 1400

to run a large enough test. This fails with a memory

ij: hypre_memory.c:34: hypre_OutOfMemory: Assertion `0' failed.

[***:mpi_rank_341][error_sighandler] Caught error: Aborted (signal 6)

Am I using the option wrong?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7-Q7Olp1$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLKVMQELOBOOYWGSYWTTIMDEXANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!iva8P8RzxV3HGbky0RX27ZSgSiXRAnsNCGMnm8cOyKjvtzCs4dpsRAjD7yxSTfNt$.

jandrej commented 3 years ago

Yes I configured with --enable-mixed-int. I ran the test on quartz. I did not try to run with the bigint option only, but I can do that if that helps.

ulrikeyang commented 3 years ago

It’s possible that you just run out of memory, since this a very large problem and if the 64-bit integer works, there really is an issue with the mixed-int version. Another thing you could try with the mixed-int version, which would use less memory is to add -agg_nl 1 to your command line for an AMG version with lower complexity and memory requirements.

From: Julian Andrej @.> Sent: Monday, April 12, 2021 8:26 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)

Yes I configured with --enable-mixed-int. I ran the test on quartz. I did not try to run with the bigint option only, but I can do that if that helps.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-817904268__;Iw!!G2kpM7uM-TzIFchu!iHVvh0X0QJBhoRatVyvECuHScUmaG3L85ANu1pM12WfatcJ4DKcv4aJCV3RDiHNY$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLJT3B3GPPYMOUUJTVLTIMGHPANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!iHVvh0X0QJBhoRatVyvECuHScUmaG3L85ANu1pM12WfatcJ4DKcv4aJCV8LEBN4F$.

jandrej commented 3 years ago

Using your suggested options

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1400 1400 1400 -agg_nl 1

works fine.

Thanks!

jandrej commented 3 years ago

I tried another option since in mfem a simple Laplace problem works fine with mixed-int. Elasticity with the systems option fails.

When I run

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1200 1200 1200 -sysL 2 -agg_nl 1 -intptype 10

the example segfaults (without further information about errors etc.)

Is the combination of -agg_nl 1 -intptype 10 supposed to work on -sysL 2? I expect this to produce a 7pt stencil with 2 equations.

ulrikeyang commented 3 years ago

This problem is now twice as big as the previous one, so it is possible you ran out of memory. You also solve this as a scalar problem but using a nodal interpolation (however that should not be the problem). However, we haven’t really tested this interpolation for mixed-int, so there could be an issue. Can you try to rerun setting -n 1200 1200 600? From: Julian Andrej @.> Sent: Monday, April 12, 2021 10:30 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)

I tried another option since in mfem a simple Laplace problem works fine with mixed-int. Elasticity with the systems option fails.

When I run

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1200 1200 1200 -sysL 2 -agg_nl 1 -intptype 10

the example segfaults (without further information about errors etc.)

Is the combination of -agg_nl 1 -intptype 10 supposed to work on -sysL 2? I expect this to produce a 7pt stencil with 2 equations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-817992045__;Iw!!G2kpM7uM-TzIFchu!m3GWs2Q4L8IW4b5zto7GaxTkqTJMxkYcCZnSUJK1R1wuis2l5A5hNn_JH_PXAHCV$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLMCMBABNDM2IXOKKTDTIMUYDANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!m3GWs2Q4L8IW4b5zto7GaxTkqTJMxkYcCZnSUJK1R1wuis2l5A5hNn_JHys-WYWt$.

jandrej commented 3 years ago

The test also fails with a much smaller allocation

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10

so I don't suspect running OOM here.

ulrikeyang commented 3 years ago

Probably not, but this is also a problem that is much smaller and can be solved with 32bits only, so I don’t expect this to be a mixed-int problem. You generally would not run this as you have. Can you add -nf 2 -nodal 1 to this and see what happens? Thanks

From: Julian Andrej @.> Sent: Monday, April 12, 2021 11:04 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)

The test also fails with a much smaller allocation

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10

so I don't suspect running OOM here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-818014580__;Iw!!G2kpM7uM-TzIFchu!kBa1g5esB6Hvdo6DP6TKLh96SBbPAgDoeGfdeLvPjLj7cDWbVFgz6aZ0mizNl8RA$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLP3CLZJGY7YK4IBHLTTIMYZDANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!kBa1g5esB6Hvdo6DP6TKLh96SBbPAgDoeGfdeLvPjLj7cDWbVFgz6aZ0ml2G4QzE$.

jandrej commented 3 years ago

Running

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10 -nf 2 -nodal 1

segfaults without further information

ulrikeyang commented 3 years ago

I just realize that it doesn’t make sense to combine interptype 10 with aggressive coarsening. It also fails on a small problem with 2 processes. Obviously, this should not just segfault, so we have to do something about that. For now remove -agg_nl 1.

From: Julian Andrej @.> Sent: Monday, April 12, 2021 11:25 AM To: hypre-space/hypre @.> Cc: Yang, Ulrike Meier @.>; Comment @.> Subject: Re: [hypre-space/hypre] Test problem for HYPRE_MIXEDINT (#326)

Running

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10 -nf 2 -nodal 1

segfaults without further information

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https:/github.com/hypre-space/hypre/issues/326*issuecomment-818029432__;Iw!!G2kpM7uM-TzIFchu!hzzs1ga18ybSqRNxsyC2yt8JY3VckiWTwho6kh1mmXKHUKq0cFg8GTEnQvqv4aiK$, or unsubscribehttps://urldefense.us/v3/__https:/github.com/notifications/unsubscribe-auth/AD4NLLN7IVYHCIPZSQ4OFB3TIM3IZANCNFSM42ZO4B6Q__;!!G2kpM7uM-TzIFchu!hzzs1ga18ybSqRNxsyC2yt8JY3VckiWTwho6kh1mmXKHUKq0cFg8GTEnQnGTl-Ij$.

drew-parsons commented 3 years ago

I'm also having problems with mixedint tests, presumably the same or related to the problem reported here. Building 2.21.0 (patch set here)

After building the mixedint library and tests, TEST_ams for instance generates this output (corresponding to solvers.out.10 from TEST_ams/solvers.jobs. Segfault also occurs for out.8, 9 and 11. I have 8 processors on this system, if that's relevant)

$ cd /build/hypre-64m/src/test/TEST_ams
$ ln  -s ../ams_driver
$ LD_LIBRARY_PATH=/build/hypre-64m/src/lib:$LD_LIBRARY_PATH mpirun -np 4 ./ams_driver -solver 5 -tol 1e-4 -h1
Problem size: 5080

=============================================
Setup phase times:
=============================================
AME Setup:
  wall clock time = 0.010000 seconds
  wall MFLOPS     = 0.000000
  cpu clock time  = 0.008005 seconds
  cpu MFLOPS      = 0.000000

Solving generalized eigenvalue problem with preconditioning

block size 5

No constraints

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node sandy exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I'm also building the standard and bigint configurations. bigint does not generate an mpirun segfault.

This is with openmpi 4.1.0 and pmix 4.0.0.

More distressing than the mpi segfault itself, the error correlates with a complete linux kernel meltdown. For some reason the hard drive bus seems to get detached after the mpi segfault is triggered, causing all filesystems to be dropped to read-only, hence a complete system failure. Since /var is also made read-only, I can't give an exact log of this behaviour.

My kernel crash is reproducible in the sense that it is currently happening every time I run mixedint tests, but does not occur with bigint tests. However the precise point at which the filesystem lockup occurs varies, sometimes during TEST_ams. sometimes TEST_ij, or TEST_lobpcg. More often in TEST_lobpcg.

With respect to the workaround suggested above, there is no -agg_nl for ams_driver. The mpi segfault reported here seems to affect several mixedint tests, not only ij but also amd_driver, sstruct, struct.