Closed timjim333 closed 3 years ago
I think that can only mean the mesh is corrupted, which is causing memory errors within parmetis. Memory errors can take some time to manifest, especially in small cases. If the case is small you can try running the serial version to see if the problem only occurs in parallel, as for what might be the rootcause of the bad mesh I have no idea.
Hi @pcarruscag thanks for the reply. So you think it might be a mesh issue? It might well be possible, as I previously had a structured collar mesh around an unstructured core for supersonic evaluation but I tried to diagonalize the collar mesh as it seemed that AMG refinement only works for triangles and tetrahedrons... I could have made a mistake in this step! I'll take another look. Cheers.
Can you run it with valgrind to check if there is a memory issue? Compile with -g. Also, does the problem persist if you reduce the number of MPI ranks?
Hi @vdweide can I just double-check what I should try? Compiling SU2 with -g
or valgrind? Thanks!
Compile with -g or when using meson just add --buildtype=debug to the arguments to build the executable. Then run it as follows
mpirun -np 40 valgrind SU2_CFD case.cfg.
The probably get quite a few false warnings from MPI, but you can filter those out. Try to reduce the number of ranks, if possible.
Ok, I've recompiled using --buildtype=debug and I'm running valgrind now. I'll try and run it at a reduced rank and get back to you. Thanks.
@vdweide I've attached the SU2 output and the valgrind output running on 2 processes, i.e.: mpirun -n 2 --use-hwthread-cpus valgrind /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg
su2_out_2.txt
valgrind_out_2.txt
I also tried with 30 processes but valgrind gave up after stating that there were too many errors.
Sorry, I'm not so familiar with what to look out for. I'm guessing that something showing in the leak summary is a bad thing? Thanks
In case it helps, I also ran valgrind using --leak-check=full
and --track-origins=yes
. I've attached the outputs here.
valgrind_out_2_leakcheck.txt
valgrind_out_2_origins.txt
No, the invalid reads and writes are problematic. There you cross the boundaries of allocated memory and anything can happen. What version/branch are you using? The line numbers valgrind gives do not correspond to the current develop version.
I'm using the 'feature_adap' branch. At least, I believe I am; I pulled the repo in this manner:
git clone https://github.com/su2code/SU2.git SU2_src
cd SU2_src
git checkout feature_adap
As far as I can tell, it's v7.0.3.
That's indeed how you get the feature_adap branch. Is it possible to merge this branch with the latest version of develop first?
I had a quick look at the merging process and it seems like quite a few files conflict. I'm not sure which files I can merge from develop and not accidentally break the feature_adap functionality. Can I more or less pull across most of these changes? I can give it a go if you can give me some pointers but I'm not well-versed in cpp! Thanks.
Common/include/CConfig.hpp
Common/include/adt/CADTElemClass.hpp
Common/include/geometry/dual_grid/CEdge.hpp
Common/include/geometry/dual_grid/CPoint.hpp
Common/include/geometry/dual_grid/CVertex.hpp
Common/include/option_structure.hpp
Common/src/adt/CADTElemClass.cpp
Common/src/geometry/CPhysicalGeometry.cpp
Common/src/geometry/dual_grid/CPoint.cpp
SU2_CFD/include/output/COutputLegacy.hpp
SU2_CFD/include/solvers/CEulerSolver.hpp
SU2_CFD/include/solvers/CSolver.hpp
SU2_CFD/src/iteration_structure.cpp
SU2_CFD/src/numerics/flow/flow_diffusion.cpp
SU2_CFD/src/output/CFlowCompOutput.cpp
SU2_CFD/src/output/output_structure_legacy.cpp
SU2_CFD/src/solvers/CEulerSolver.cpp
SU2_CFD/src/solvers/CNSSolver.cpp
SU2_CFD/src/solvers/CSolver.cpp
SU2_CFD/src/solvers/CTurbSASolver.cpp
SU2_CFD/src/solvers/CTurbSSTSolver.cpp
SU2_CFD/src/solvers/CTurbSolver.cpp
SU2_CFD/src/variables/CEulerVariable.cpp
SU2_DOT/src/meson.build
SU2_IDE/Xcode/SU2_CFD.xcodeproj/project.pbxproj
SU2_PY/pySU2/pySU2.i
SU2_PY/pySU2/pySU2ad.i
meson_scripts/init.py
preconfigure.py
No, you cannot just do that. Somebody who worked on feature_adap should have a look at it. @bmunguia, it looks like you made the latest commit to this branch, but that is already quite some time ago (May 2020). What is the current status and do you plan to merge with the latest version of develop?
I see, I hope that @bmunguia will have a chance to take a look! I tried to have a look through the past commits but I didn't manage to successfully merge all the functions. From what I can tell, these are the edited variables/functions:
CConfig Class
Cvertex (not sure if values should be initialised)
Option_structure
CPhysicalGeometry - probably AMG stuff?
Common/src/geometry/dual_grid/CPoint.cpp'
COutputLegacy.hpp
output_structure_legacy.cpp'
Csolver
meson.build
Init.py - add amgio stuff
Preconfigure.py
I'm unsure if the AMG version uses its own implementation of vertices etc. or if these happen to be the way that they were implemented in older versions of SU2.
Most likely a mixture of those 2 things, but it should not be too difficult to fix.
@pcarruscag Could help me take a look through or give me some pointers on where to start? I've not programmed in C++ before (maybe a good time to start with lockdown...) but if it's the case of figuring out how to merge already working code, I might be able to hack together something. To be honest, though, it might be better/faster for someone who actually knows what they're doing to do so!
I wanted to use this functionality as part of another project, so I'm just wary of breaking something not obvious in the background.
I could but I do not think updating that branch will fix your problem. We have not found any mesh handling bugs recently. Creating / modifying meshes manually can get tricky (at least in my experience). Have you tried simpler problems? Try starting with a problem that is known to work (there is a long issue with success stories, do a search for mesh adaptation here on github). Then build up from it, e.g. take the same problem and use a finer grid, change the physics to what you need, use a grid for your problem (ideally change one thing at a time). Also keep in mind that if that branch was finished work it would probably have been merged into develop by now...
@pcarruscag I see, sorry I didn't realise that it still might be a mesh problem - I thought it was a memory issue from the error messages! Ok, I'll give it another try from scratch. If I understand correctly, amg only works with triangles and tetrahedrons, not pyramids or quads, is that right? Thanks again.
Hi @pcarruscag I just tried a simpler mesh and using MPI I get the UCX crash. err_log_SU2v7.0.3.txt
To double check, I also used the master v7.0.8 SU2_CFD. When I run with MPI, I get the UCX error but when I run in serial, the solution appears to converge fine. I suspect that this means it's probably not the mesh that is causing the issues - what are your thoughts? su2_out_serial.txt
I looked for "UCX error" and got e.g. this https://github.com/openucx/ucx/issues/4742 IDK but it looks like an MPI configuration problem...
Interesting - my MPI is straight from the CentOS repo, so I didn't expect it to be the issue but I'll try to compile another version just to check.
After pulling in the latest OpenMPI v3 (3.1.6) and recompiling the mpi4py and SU2 branch, this error seems to have gone away! Thank you for your help @pcarruscag @vdweide
Hi, I'm opening a new thread since it seems that this issue isn't directly related to the AMG mesh refinement itself, but feel free to close or move this to a more appropriate place @pcarruscag
I'm having an issue when running
SU2_CFD
in thefeature_adap
branch (so this means that it also fails when trying to run the mesh refinement script). It seems to run fine for theTestCase/euler/naca0012
but when I try it on my mesh I get aUCX ERROR
.On running:
mpirun -n 40 --use-hwthread-cpus /opt/su2/SU2v7_adap/bin/SU2_CFD test.cfg
, I seem to get variations on this message in my screen output:Sometimes it hangs to the
UCX ERROR
lines straight afterBuilding the graph adjacency structure.
in Geometry Preprocessing and other times, it seems to run fine for the first batch of iterations until it hits the first solution file writing iteration (as set byOUTPUT_WRT_FREQ
) show in the above output snip.Do you have any hints on how to debug this or what might be causing this? Thanks.
To Reproduce I've attached the mesh and config file in this link.
Desktop (please complete the following information):