jorgensd / dolfinx-tutorial

A reimplementation of the Springer book: https://github.com/hplgit/fenics-tutorial/, covering new topics as well as transitioning from dolfin to dolfinx
https://jorgensd.github.io/dolfinx-tutorial/
114 stars 64 forks source link

PETSC ERROR when simulating flow past cylinder using mpirun #224

Closed noirchen closed 1 week ago

noirchen commented 3 weeks ago

Hello, I am running into a PETSC error when I run the ns-code2 example. I convert the notebook to a python script and execute mpiexec -np 8 python ns-code2.py. All the test figures are correctly generated, but a petsc error of signal number 11 occurs. I installed fenicsx using conda on a linux machine with 16 cores.

The output is

Info    : [ 30%] Difference - Performing Face-Face intersection                                                         Info    : [ 70%] Difference - Performing intersection of shapes                                                         Info    : [ 80%] Difference - Making faces                                                                              Info    : [ 90%] Difference - Adding holes                                                                              Info    : Meshing 1D...
Info    : [  0%] Meshing curve 5 (Ellipse)
Info    : [ 30%] Meshing curve 6 (Line)
Info    : [ 50%] Meshing curve 7 (Line)
Info    : [ 70%] Meshing curve 8 (Line)
Info    : [ 90%] Meshing curve 9 (Line)
Info    : Done meshing 1D (Wall 0.0156927s, CPU 0.015616s)
Info    : Meshing 2D...
Info    : Meshing surface 1 (Plane, Frontal-Delaunay for Quads)
Info    : Simple recombination completed (Wall 0.00590733s, CPU 0.005909s): 103 quads, 16 triangles, 0 invalid quads, 0 quads with Q < 0.1, avg Q = 0.82361, min Q = 0.425555
Info    : Simple recombination completed (Wall 0.00678814s, CPU 0.006789s): 460 quads, 0 triangles, 0 invalid quads, 0 quads with Q < 0.1, avg Q = 0.865979, min Q = 0.499255
Info    : Done meshing 2D (Wall 0.0214993s, CPU 0.0215s)
Info    : Refining mesh...
Info    : Meshing order 2 (curvilinear on)...
Info    : [  0%] Meshing curve 5 order 2
Info    : [ 20%] Meshing curve 6 order 2
Info    : [ 40%] Meshing curve 7 order 2
Info    : [ 60%] Meshing curve 8 order 2
Info    : [ 70%] Meshing curve 9 order 2
Info    : [ 90%] Meshing surface 1 order 2
Info    : Done meshing order 2 (Wall 0.00653468s, CPU 0.005628s)
Info    : Done refining mesh (Wall 0.00710949s, CPU 0.006142s)
Info    : 1952 nodes 2069 elements
Info    : Meshing order 2 (curvilinear on)...
Info    : [  0%] Meshing curve 5 order 2
Info    : [ 20%] Meshing curve 6 order 2
Info    : [ 40%] Meshing curve 7 order 2
Info    : [ 60%] Meshing curve 8 order 2
Info    : [ 70%] Meshing curve 9 order 2
Info    : [ 90%] Meshing surface 1 order 2
Info    : Done meshing order 2 (Wall 0.0258181s, CPU 0.024792s)
Info    : Optimizing mesh (Netgen)...
Info    : Done optimizing mesh (Wall 2.12e-06s, CPU 3e-06s)
Solving PDE: 100%|██████████| 12800/12800 [07:17<00:00, 28.90it/s][4]PETSC ERROR: ------------------------------------------------------------------------
[4]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[4]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[4]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[4]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[4]PETSC ERROR: to get more information on the crash.
[4]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
Abort(59) on node 4 (rank 4 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 4
jorgensd commented 1 week ago

I can't reproduce this with docker run -ti --network=host -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v $(pwd):/root/shared -w /root/shared --rm ghcr.io/fenics/dolfinx/dolfinx:nightly

Could you specify how you installed DOLFINx, and what version?

noirchen commented 1 week ago

I installed dolfinx on a Ubuntu system with conda, and the version is 0.9.0.

jorgensd commented 1 week ago

Could you try to locate what line of the script causes the seg fault? As it seems the temporal loop finished, could you comment the loop out and remove line by line from the bottom to see where the issues lies?

noirchen commented 1 week ago

I suspect that something related to the mpirun because if I run the script without mpirun, no error was produced. I also run the script on mac with the same version of dolfinx installed through conda, the result was the same: no error in serial, and PETSC error with mpirun. In both case, the dolfinx environment was installed with

conda create -n fenics ipython mpich pyvista
conda activate fenics
conda install -c conda-forge ipykernel fenics-dolfinx imageio gmsh python-gmsh tqdm
jorgensd commented 1 week ago

I used mpirun with 8 processes as well. That is why I asked if you could do some detective work as to trying to locate the offending line.

noirchen commented 1 week ago

I did what you suggested and to my surprise, it is the tqdm.autonotebook that caused the problem. That explains why all the plots and calculations are done correctly.

jorgensd commented 1 week ago

Strange. Then just remove it as it is not required to run the code (a simple for loop will do). @RemDelaporteMathurin was this what you observed in festim ?

noirchen commented 1 week ago

Yeah I replaced it with a single tqdm.tqdm and all went well.

RemDelaporteMathurin commented 1 week ago

@noirchen have you tried closing the tqdm bar with .close() at the end of the script?

noirchen commented 1 week ago

Yes, close with progress.close() eliminates the error.