hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
681 stars 188 forks source link

hang in hypre_MatvecCommPkgCreate() #626

Closed ptsuji closed 2 years ago

ptsuji commented 2 years ago

With Hypre 2.24.0 and MFEM 4.4, I'm trying to construct a rectangular matrix using the generic CSR interface in MFEM (https://github.com/mfem/mfem/blob/fef1928708bb455f46aa611ddf73a4fd3d1c1974/linalg/hypre.cpp#L1176), but the code hangs indefinitely upon calling hypre_MatvecCommPkgCreate() when running on more than one process (https://github.com/mfem/mfem/blob/fef1928708bb455f46aa611ddf73a4fd3d1c1974/linalg/hypre.cpp#L1312). I was able to print out the matrices right before, and have attached the files for a 2 processor decomposition here.

hypre_ParCSRMatrix.00000.txt hypre_ParCSRMatrix.00001.txt

Upon further inspection, I found that in hypre_DataExchangeList(), processor 0 is calling hypre_MPI_Testall repeatedly without success, so it's caught in an infinite loop there. I'm using the same communicator to construct the main stiffness matrix (which worked fine), but I don't think that's the issue, as I tried duplicating the communicator and it didn't change things. Is calling hypre_MatvecCommPkgCreate() on two separate matrices the problem?

As an aside - is integrated testing with MFEM done on a nightly or weekly basis? I didn't see this interface in any of the MFEM or Hypre example/test directories.

ptsuji commented 2 years ago

Tagging @tzanio @barker29 and @jamiebramwell so they know that I've posted this issue. For context, it occurs when trying to construct the constraint matrices for ALE3D implicit hydro, which I'm giving to the SchurConstrainedHypreSolver class. On a single process, the problem runs as expected, but on two or more processes, it hangs (as described above).

waynemitchell commented 2 years ago

Hi @ptsuji . Sorry for the delayed response. I'll take a look at this for you. Stay tuned.

ptsuji commented 2 years ago

Thanks @waynemitchell!

waynemitchell commented 2 years ago

@ptsuji , I made a simple code to read in your provided matrix and generate the comm pkg, and I do not observe a hang. The code seems to run as expected for me. Maybe I misunderstand the problem you are having? Or this bug arises in your code due to some combination of issues that is not present in my simple reproducer. How do you have hypre configured? Since you managed to print out the matrix just before generating the comm pkg, I guess that both processes are reaching this part of the code?

ptsuji commented 2 years ago

@waynemitchell I've attached the config.log file for Hypre here:

config.log

Are you building the rectangular matrix using the MFEM constructor (https://github.com/mfem/mfem/blob/fef1928708bb455f46aa611ddf73a4fd3d1c1974/linalg/hypre.cpp#L1176) or doing your own sequence of hypre_ParCSR functions? If so, what sequence of functions are you doing? Yes, I was able to get both processes to reach that part of the code.

In our code, before constructing the rectangular matrix we construct a square stiffness matrix, then use the same column partitioning of that matrix for the rectangular matrix. I don't think the construction of the first matrix has anything to do with it, but I can check if there is anything weird happening there. I just ran the address sanitizer on some serial problems and it came back clean.

waynemitchell commented 2 years ago

Hmm... using the partitioning from the square matrix may change the behavior, I guess. And this is something that is not present in my simple reproducer. I'll have to look into that more. For your reference, here is the simple code I'm running:

/ Read IJ matrix / HYPRE_IJMatrix ij_A = NULL; HYPRE_IJMatrixRead( "hypre_ParCSRMatrix", hypre_MPI_COMM_WORLD, HYPRE_PARCSR, &ij_A );

/ Get the ParCSR matrix / HYPRE_ParCSRMatrix parcsr_A = NULL; void *object; HYPRE_IJMatrixGetObject(ij_A, &object); parcsr_A = (HYPRE_ParCSRMatrix) object;

/ Try creating comm pkg / hypre_MatvecCommPkgCreate(parcsr_A);

waynemitchell commented 2 years ago

How are you enforcing the column partitioning from the stiffness matrix? And what is that partitioning?

ptsuji commented 2 years ago

The column partitioning is in the two matrix files attached - first process owns columns 0 - 311, the second process owns columns 312 - 629. I'm enforcing it through that call to the MFEM constructor, which is calling hypre_ParCSRMatrixCreate with the column starts array under the hood:

https://github.com/mfem/mfem/blob/fef1928708bb455f46aa611ddf73a4fd3d1c1974/linalg/hypre.cpp#L1250

Do you see anything below these lines which might possibly cause issues?

waynemitchell commented 2 years ago

Ah, I misunderstood. I thought you were saying you used some other column partitioning besides the one given in the matrix files above. OK, then I don't think there should be any inherent issue in hypre_MatvecCommPkgCreate() for this matrix, since it is working fine for me. I'll take a closer look at the MFEM code and get back to you.

waynemitchell commented 2 years ago

@ptsuji I am looking at this again today. I've copy and pasted the mfem constructor code into a standalone minimal code that reads in your matrices. I am able to reproduce the hang in this way. So my guess is that there is a bug somewhere in the constructor. I will keep digging and compare what the mfem constructor is doing compared with what is done by hypre's IJ read functionality (which was able to read in your info and setup a ParCSR matrix with a comm pkg no problem).

waynemitchell commented 2 years ago

@ptsuji Well... I spoke a bit too soon. The hang I observed was due to an error I introduced in my minimal reproducer code. I'm now able to generate a comm pkg using the same copy/pasted code as the MFEM constructor.

The call to creating the comm pkg should only rely on a few things to be set correctly in the parcsr matrix. Can you print the following values to check their correctness? hypre_ParCSRMatrixFirstColDiag(A) hypre_CSRMatrixNumCols(hypre_ParCSRMatrixOffd(A)) hypre_ParCSRMatrixGlobalNumCols(A) and all the values in hypre_ParCSRMatrixColMapOffd(A), which is an array of length num_cols_offd

I obtain the following (whether reading in the matrix with hypre or trying to use the mfem constructor): Rank 0: first_col_diag = 0 num_cols_offd = 108 global_num_cols = 630 colmapoffd = 312 313 314 315 316 317 318 319 320 327 328 329 330 331 332 339 340 341 348 349 350 357 358 359 366 367 368 375 376 377 384 385 386 393 394 395 402 403 404 411 412 413 432 433 434 441 442 443 450 451 452 459 460 461 468 469 470 477 478 479 486 487 488 495 496 497 498 499 500 513 514 515 516 517 518 531 532 533 534 535 536 549 550 551 558 559 560 567 568 569 576 577 578 579 580 581 582 583 584 585 586 587 612 613 614 615 616 617

Rank 1: first_col_diag = 312 num_cols_offd = 21 global_num_cols = 630 colmapoffd = 192 193 194 198 199 200 204 205 206 210 211 212 216 217 218 222 223 224 240 241 242

ptsuji commented 2 years ago

@waynemitchell to set the global number of columns, I was using a MFEM function from the parallel finite element space class called GlobalVSize() (https://docs.mfem.org/html/pfespace_8hpp_source.html#l00283). However, this number does not give the right number of columns when running on more than one process somehow. I've fixed the error by Getting the number of columns directly from the stiffness matrix. Sorry for the unnecessary wild goose chase.

waynemitchell commented 2 years ago

@ptsuji no problem. Glad to know the issue is resolved! I'll go ahead and close this.