su2code / SU2

SU2: An Open-Source Suite for Multiphysics Simulation and Design
https://su2code.github.io
Other
1.34k stars 844 forks source link

adap branch UCX error #1156

Closed timjim333 closed 3 years ago

timjim333 commented 3 years ago

Hi, I'm opening a new thread since it seems that this issue isn't directly related to the AMG mesh refinement itself, but feel free to close or move this to a more appropriate place @pcarruscag

I'm having an issue when running SU2_CFD in the feature_adap branch (so this means that it also fails when trying to run the mesh refinement script). It seems to run fine for the TestCase/euler/naca0012 but when I try it on my mesh I get a UCX ERROR.

On running: mpirun -n 40 --use-hwthread-cpus /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg, I seem to get variations on this message in my screen output:

|          49|   -2.095057|    0.015781|    0.001431|    0.000000|  9.1667e+04|
|          50|   -2.140503|    0.015781|    0.001431|    0.000000|  9.1667e+04|
+-----------------------------------------------------------------------+
|        File Writing Summary       |              Filename             |
+-----------------------------------------------------------------------+
|SU2 restart                        |restart_flow.dat                   |
|Paraview binary                    |flow.vtk                           |
|Paraview binary surface            |surface_flow.vtk                   |
[1609922278.175246] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=56) failed: Bad address
[1609922278.175301] [super:1134625:0]           sock.c:344  UCX  ERROR recv(fd=54) failed: Connection reset by peer
[1609922278.175551] [super:1134625:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor

SU2_CFD: ../externals/parmetis/libparmetis/match.c:243: libparmetis__Match_Global: Assertion `k >= firstvtx && k < lastvtx' failed.
[super:1134138] *** Process received signal ***
[super:1134138] Signal: Aborted (6)
[super:1134138] Signal code:  (-6)
[super:1134138] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fb93d021b20]
[super:1134138] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fb93c1507ff]
[super:1134138] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fb93c13ac35]
[super:1134138] [ 3] /lib64/libc.so.6(+0x21b09)[0x7fb93c13ab09]
[super:1134138] [ 4] /lib64/libc.so.6(+0x2fde6)[0x7fb93c148de6]
[super:1134138] [ 5] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9be03]
[super:1134138] [ 6] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a94e76]
[super:1134138] [ 7] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x1a9590d]
[super:1134138] [ 8] /opt/su2/SU2v7_adap/bin/SU2_CFD[0xabb1bb]
[super:1134138] [ 9] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ddf6b]
[super:1134138] [10] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7ded07]
[super:1134138] [11] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7df356]
[super:1134138] [12] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x7e445f]
[super:1134138] [13] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x45ba61]
[super:1134138] [14] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fb93c13c7b3]
[super:1134138] [15] /opt/su2/SU2v7_adap/bin/SU2_CFD[0x47216e]
[super:1134138] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 38 with PID 0 on node super exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Sometimes it hangs to the UCX ERROR lines straight after Building the graph adjacency structure. in Geometry Preprocessing and other times, it seems to run fine for the first batch of iterations until it hits the first solution file writing iteration (as set by OUTPUT_WRT_FREQ) show in the above output snip.

Do you have any hints on how to debug this or what might be causing this? Thanks.

To Reproduce I've attached the mesh and config file in this link.

Desktop (please complete the following information):

pcarruscag commented 3 years ago

I think that can only mean the mesh is corrupted, which is causing memory errors within parmetis. Memory errors can take some time to manifest, especially in small cases. If the case is small you can try running the serial version to see if the problem only occurs in parallel, as for what might be the rootcause of the bad mesh I have no idea.

timjim333 commented 3 years ago

Hi @pcarruscag thanks for the reply. So you think it might be a mesh issue? It might well be possible, as I previously had a structured collar mesh around an unstructured core for supersonic evaluation but I tried to diagonalize the collar mesh as it seemed that AMG refinement only works for triangles and tetrahedrons... I could have made a mistake in this step! I'll take another look. Cheers.

vdweide commented 3 years ago

Can you run it with valgrind to check if there is a memory issue? Compile with -g. Also, does the problem persist if you reduce the number of MPI ranks?

timjim333 commented 3 years ago

Hi @vdweide can I just double-check what I should try? Compiling SU2 with -g or valgrind? Thanks!

vdweide commented 3 years ago

Compile with -g or when using meson just add --buildtype=debug to the arguments to build the executable. Then run it as follows

mpirun -np 40 valgrind SU2_CFD case.cfg.

The probably get quite a few false warnings from MPI, but you can filter those out. Try to reduce the number of ranks, if possible.

timjim333 commented 3 years ago

Ok, I've recompiled using --buildtype=debug and I'm running valgrind now. I'll try and run it at a reduced rank and get back to you. Thanks.

timjim333 commented 3 years ago

@vdweide I've attached the SU2 output and the valgrind output running on 2 processes, i.e.: mpirun -n 2 --use-hwthread-cpus valgrind /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg su2_out_2.txt valgrind_out_2.txt

I also tried with 30 processes but valgrind gave up after stating that there were too many errors.

Sorry, I'm not so familiar with what to look out for. I'm guessing that something showing in the leak summary is a bad thing? Thanks

timjim333 commented 3 years ago

In case it helps, I also ran valgrind using --leak-check=full and --track-origins=yes. I've attached the outputs here. valgrind_out_2_leakcheck.txt valgrind_out_2_origins.txt

vdweide commented 3 years ago

No, the invalid reads and writes are problematic. There you cross the boundaries of allocated memory and anything can happen. What version/branch are you using? The line numbers valgrind gives do not correspond to the current develop version.

timjim333 commented 3 years ago

I'm using the 'feature_adap' branch. At least, I believe I am; I pulled the repo in this manner:

git clone https://github.com/su2code/SU2.git SU2_src
cd SU2_src
git checkout feature_adap

As far as I can tell, it's v7.0.3.

vdweide commented 3 years ago

That's indeed how you get the feature_adap branch. Is it possible to merge this branch with the latest version of develop first?

timjim333 commented 3 years ago

I had a quick look at the merging process and it seems like quite a few files conflict. I'm not sure which files I can merge from develop and not accidentally break the feature_adap functionality. Can I more or less pull across most of these changes? I can give it a go if you can give me some pointers but I'm not well-versed in cpp! Thanks.

Common/include/CConfig.hpp
Common/include/adt/CADTElemClass.hpp
Common/include/geometry/dual_grid/CEdge.hpp
Common/include/geometry/dual_grid/CPoint.hpp
Common/include/geometry/dual_grid/CVertex.hpp
Common/include/option_structure.hpp
Common/src/adt/CADTElemClass.cpp
Common/src/geometry/CPhysicalGeometry.cpp
Common/src/geometry/dual_grid/CPoint.cpp
SU2_CFD/include/output/COutputLegacy.hpp
SU2_CFD/include/solvers/CEulerSolver.hpp
SU2_CFD/include/solvers/CSolver.hpp
SU2_CFD/src/iteration_structure.cpp
SU2_CFD/src/numerics/flow/flow_diffusion.cpp
SU2_CFD/src/output/CFlowCompOutput.cpp
SU2_CFD/src/output/output_structure_legacy.cpp
SU2_CFD/src/solvers/CEulerSolver.cpp
SU2_CFD/src/solvers/CNSSolver.cpp
SU2_CFD/src/solvers/CSolver.cpp
SU2_CFD/src/solvers/CTurbSASolver.cpp
SU2_CFD/src/solvers/CTurbSSTSolver.cpp
SU2_CFD/src/solvers/CTurbSolver.cpp
SU2_CFD/src/variables/CEulerVariable.cpp
SU2_DOT/src/meson.build
SU2_IDE/Xcode/SU2_CFD.xcodeproj/project.pbxproj
SU2_PY/pySU2/pySU2.i
SU2_PY/pySU2/pySU2ad.i
meson_scripts/init.py
preconfigure.py
vdweide commented 3 years ago

No, you cannot just do that. Somebody who worked on feature_adap should have a look at it. @bmunguia, it looks like you made the latest commit to this branch, but that is already quite some time ago (May 2020). What is the current status and do you plan to merge with the latest version of develop?

timjim333 commented 3 years ago

I see, I hope that @bmunguia will have a chance to take a look! I tried to have a look through the past commits but I didn't manage to successfully merge all the functions. From what I can tell, these are the edited variables/functions:

CConfig Class

Cvertex (not sure if values should be initialised)

Option_structure

CPhysicalGeometry - probably AMG stuff?

Common/src/geometry/dual_grid/CPoint.cpp'

COutputLegacy.hpp

output_structure_legacy.cpp'

Csolver

meson.build

Init.py - add amgio stuff

Preconfigure.py

timjim333 commented 3 years ago

I'm unsure if the AMG version uses its own implementation of vertices etc. or if these happen to be the way that they were implemented in older versions of SU2.

pcarruscag commented 3 years ago

Most likely a mixture of those 2 things, but it should not be too difficult to fix.

timjim333 commented 3 years ago

@pcarruscag Could help me take a look through or give me some pointers on where to start? I've not programmed in C++ before (maybe a good time to start with lockdown...) but if it's the case of figuring out how to merge already working code, I might be able to hack together something. To be honest, though, it might be better/faster for someone who actually knows what they're doing to do so!

I wanted to use this functionality as part of another project, so I'm just wary of breaking something not obvious in the background.

pcarruscag commented 3 years ago

I could but I do not think updating that branch will fix your problem. We have not found any mesh handling bugs recently. Creating / modifying meshes manually can get tricky (at least in my experience). Have you tried simpler problems? Try starting with a problem that is known to work (there is a long issue with success stories, do a search for mesh adaptation here on github). Then build up from it, e.g. take the same problem and use a finer grid, change the physics to what you need, use a grid for your problem (ideally change one thing at a time). Also keep in mind that if that branch was finished work it would probably have been merged into develop by now...

timjim333 commented 3 years ago

@pcarruscag I see, sorry I didn't realise that it still might be a mesh problem - I thought it was a memory issue from the error messages! Ok, I'll give it another try from scratch. If I understand correctly, amg only works with triangles and tetrahedrons, not pyramids or quads, is that right? Thanks again.

timjim333 commented 3 years ago

Hi @pcarruscag I just tried a simpler mesh and using MPI I get the UCX crash. err_log_SU2v7.0.3.txt

To double check, I also used the master v7.0.8 SU2_CFD. When I run with MPI, I get the UCX error but when I run in serial, the solution appears to converge fine. I suspect that this means it's probably not the mesh that is causing the issues - what are your thoughts? su2_out_serial.txt

pcarruscag commented 3 years ago

I looked for "UCX error" and got e.g. this https://github.com/openucx/ucx/issues/4742 IDK but it looks like an MPI configuration problem...

timjim333 commented 3 years ago

Interesting - my MPI is straight from the CentOS repo, so I didn't expect it to be the issue but I'll try to compile another version just to check.

timjim333 commented 3 years ago

After pulling in the latest OpenMPI v3 (3.1.6) and recompiling the mpi4py and SU2 branch, this error seems to have gone away! Thank you for your help @pcarruscag @vdweide