trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

STK: Use of uninitialized variable #8206

Closed lxmota closed 4 years ago

lxmota commented 4 years ago

Bug Report

@trilinos/stk

Description

Throws error in 4-proc run but not in serial when reading a small exodus file in Albany/LCM. The geometry is a simple parallelepiped with 128 hexahedral elements, and it is attached here:

cuboid.zip

The error is:

Error adding (FACE_RANK,334) to mesh: topology defined as both TRIANGLE_3 and as TRIANGLE_4; a given mesh entity must have only one topology.

The function that throws the error is:

https://github.com/trilinos/Trilinos/blob/master/packages/stk/stk_mesh/stk_mesh/base/MetaData.cpp#L1087

This function defines an object stk::topology topology here:

https://github.com/trilinos/Trilinos/blob/master/packages/stk/stk_mesh/stk_mesh/base/MetaData.cpp#L1091

which is a struct defined here:

https://github.com/trilinos/Trilinos/blob/master/packages/stk/stk_topology/stk_topology/topology.hpp#L63

that struct has no default constructor, so the stk::topology topology object is left uninitialized. But then it is used here, apparently uninitialized:

https://github.com/trilinos/Trilinos/blob/master/packages/stk/stk_mesh/stk_mesh/base/MetaData.cpp#L1104

which is an if statement that has the error message that we see above.

ikalash commented 4 years ago

Taking myself off assignee list but tagging myself as an interested party. @ikalash

lxmota commented 4 years ago

@ikalash How do you do that? I couldn't figure out and that's why I just put you as assignee. Just by mentioning you?

ikalash commented 4 years ago

I just put @ikalash in the issue at the end, or anywhere in the comments. Then all the people tagged will get notifications.

alanw0 commented 4 years ago

Thanks for the bug report, we will investigate right away, and we'll keep you posted.

alanw0 commented 4 years ago

I took your cuboid mesh files and used a small test program to read them with stk-io. I tried using 4 procs and reading file-per-processor, I tried 4 procs and read the serial-mesh-file with auto-decomp, and I tried 1 proc reading the serial-mesh-file. I also ran with and without valgrind. There were no errors or issues for any of these runs. I notice that the mesh contains only hex-8 elements, which means the surfaces/sidesets will be quad-4 faces. There are no triangles in this mesh. Is it possible that the attached mesh is the wrong mesh? Or perhaps we need to reproduce this using an albany program? Can you give me a recipe for building/running an albany program that reproduces this error?

One last caveat is, I was using the version of stk in sierra, which is newer than the current version in Trilinos, however I don't know of any recent fixes that would have affected this. But I will repeat my stk-io runs with the current Trilinos version of stk to see whether something has changed since the last snapshot.

ikalash commented 4 years ago

@alanw0 : Thanks for looking at this! I suspect you will need to run Albany to reproduce the error, but @lxmota will have to confirm. I think the error only happens when using the mesh erosion capabilities that have recently been added to Albany. I was also puzzled as to why there is an error about triangles for an all-hex mesh when I saw the error message. @lxmota can you please confirm that the issue only happens when encountering the erosion capability in Albany? If that's the case, I am happy to provide instructions to @alanw0 for how to build Albany to reproduce the error.

lxmota commented 4 years ago

@alanw0 Yes, thanks for checking on this. I rebuilt the mesh from scratch to make sure that it is composed of hex elements. Also, the error sometimes complains about mixing quadrilaterals and triangles, with this very same mesh. As @ikalash says, we have only observed this error in one case that happens to do mesh adaptation by means of removing elements from the mesh the same way that Sierra/SM does "element death" (@ikalash I only observe this in the MiniErosion test and only in parallel). In fact, I ported some of Sierra/SM's algorithms to do this. But the error happens right at the beginning, when reading the mesh, before the simulations starts.

@ikalash is probably right that we would need to build Albany to trigger this error since it doesn't happen with STK only.

ikalash commented 4 years ago

@alanw0 : I can provide build instructions for Albany to reproduce the error on a machine of your choosing. Any preference for a machine? I have scripts for CEE and a generic rhel7 machine with seems modules. Would one of those work?

@lxmota can you please send me the relevant Albany files to reproduce the error so I can pass them along to Alan?

lxmota commented 4 years ago

@ikalash The files are those of the MiniErosion test:

https://github.com/SNLComputation/LCM/tree/master/tests/LCM/ACE/MiniErosion

And to reproduce:

mpirun -np 4 Albany coupled.yaml

within that subdirectory.

alanw0 commented 4 years ago

@ikalash CEE sounds like it should work, I have a cee blade. Thanks!

ikalash commented 4 years ago

@lxmota oh ok so it’s my example. Thanks.

@alanw0 I will send instructions later today by email. Please stay tuned.

tasmith4 commented 4 years ago

@lxmota we are trying to reproduce, and instead of seeing the error you reported, we get a segfault that appears to be in a different area of the code. Can you give us the full output of your failing run so we can see where the segfault appears in relation to your error?

ikalash commented 4 years ago

I believe Alejandro is out on Mondays, so I'm attaching my output. This is from a mixed build (debug Albany built on top of release Trilinos). Alejandro may be able to provide more information, as he ran the problem through a debugger using Eclipse. We can wait for him to comment.

out.txt

tasmith4 commented 4 years ago

Thanks @ikalash, we'll take a look.

Tagging @ddement

lxmota commented 4 years ago

I can confirm that I still get the error shown in the output log that @ikalash attached.

lxmota commented 4 years ago

More info. On CEE I actually see a segfault as well, as in the cee.txt attached file. @ikalash and I usually compile and run on local SNL/CA machines that have Fedora on them. On those machines and in release mode, the error is as described above, as seen in the fedora-release.txt file. But if I compile in debug mode, I see a different error, stk::CommBuffer::unpack<T>(...){ overflow by -12 bytes. }, as shown in the file fedora-debug.txt. cee.txt fedora-debug.txt fedora-release.txt

alanw0 commented 4 years ago

We are able to reproduce the segfault, which is the same error as the unpack/overflow in debug (the throw is active in debug, but in release it proceeds to the segfault). We will figure out what's going on with that error. We haven't seen the error about triangles. We suspect that it is related to the same error, meaning that overflowing the buffer/unpack may be reaching into bad memory and somehow causing the corruption that manifests as the triangle error. We'll keep you posted.

lxmota commented 4 years ago

@alanw0 Ok, thanks. One more bit of info: I just checked and on CEE in debug mode I also get the segfault, the same as on CEE in release mode. I don't get the throw, not sure why. My debug compile is both for Trilinos and Albany on CEE or locally.

alanw0 commented 4 years ago

@lxmota that's weird, I just confirmed the throw is inside #ifndef NDEBUG... oh well, in any case we'll figure out what's going on.

alanw0 commented 4 years ago

Hi Irina and Alejandro, we've been digging deep on this issue... We have something for you to try. On line 83 of Albany_IossSTKMeshStruct.cpp you are making a call to set sideset face creation behavior. If you don't do that, (i.e., comment out the line) then this test appears to run. This enum being set by that call is old, and we hope to deprecate it anyway. The default setting for that provides more correct face-creation behavior (at least in all of the cases we are aware of). We haven't determined whether this change is ok for all of your tests, could you take a look and try it out?

alanw0 commented 4 years ago

P.S. It looks like this problem does throw later in the run, where you appear to be asserting that a side should only have 1 element attached. But this case was a sideset between element-blocks, and so a side would be expected to have 2 attached elements.

ikalash commented 4 years ago

Thanks @alanw0 ! I am investigating this now. It looks like line 83 was added by @lxmota in May of this year from git blame. @lxmota did you add this because it was necessary for the erosion?

ikalash commented 4 years ago

@alanw0 : I just tried this and unfortunately with line 83 commented out, the test fails in serial (it does now run in parallel, as you observed). Here's the error:

: STKDisc: nodeset x- has size 45  on Proc 0.
1: STKDisc: nodeset y+ has size 45  on Proc 0.
1: STKDisc: nodeset y- has size 45  on Proc 0.
1: STKDisc: nodeset z+ has size 25  on Proc 0.
1: STKDisc: nodeset z- has size 25  on Proc 0.
1: STKDisc: sideset bluff_face-erodible has size 176  on Proc 0.
1: bulkData.num_elements(sidee) != 1 ALBANY_PANIC condition at /home/ikalash/ACE/Albany-LCM-master/src/disc/stk/Albany_STKDiscretization.cpp +2116
1: STKDisc: cannot figure out side set topology for side set bluff_face-erodible

@alanw0 , you must get it too if you run the test in serial (``./Albany coupled.yaml'').

It looks like that line is needed to work correctly with your erosion logic @lxmota . Any idea for why the erosion logic doesn't work correctly in serial when the line is commented out?

alanw0 commented 4 years ago

@ikalash that is the same error (num_elements != 1) that I saw in parallel. Let me double-check that this sideset is between 2 element-blocks....

ikalash commented 4 years ago

@alanw0 : Yes I just got the same error running on 4 procs but later on. Thanks for checking.

lxmota commented 4 years ago

@ikalash I don't recall adding or modifying that line. The git blame is probably just a clang-format thing.

alanw0 commented 4 years ago

@ikalash that is indeed a side between 2 elements, so it's not an erroneous condition. The Albany error message there says "cannot figure out side set topology..." but we can definitely provide the topology from stk-mesh. We can provide the topology of the side, and we can also provide the topology of the element block or blocks that are touching the sideset. Let me know what you need there and I'll help you figure out how to get it.

ikalash commented 4 years ago

@alanw0 : hmmm, interesting. Thanks for looking at it. I'm not exactly sure what we need here - Alejandro was the one who implemented the erosion capability in Albany. Perhaps you can work with @alanw0 , @lxmota , to fix the code to have the expected behavior?

lxmota commented 4 years ago

@ikalash @alanw0 I'm not sure what we need here either, or why this error would be triggered by the mesh adaptation that I implemented, since I didn't modify this part of the code for that. It looks like if we provide the correct topology, this error could go away, so perhaps that's the way to go.

ikalash commented 4 years ago

FWIW, a handful of Albany tests that do not do erosion fail with line 83 of Albany_IossSTKMeshStruct.cpp commented out. Here is the list:

The following tests FAILED:
          4 - ACE_StandAloneThermal2D_2Blocks (Failed)
          5 - ACE_StandAloneThermal2D_2Blocks_Explicit (Failed)
         15 - ACE_MiniErosion (Failed)
         92 - MechanicsPorePressureLocalized_Serial (Failed)
         93 - MechanicsPorePressureParallelFlow_Serial (Failed)
        101 - MechanicsWithHydrogenParallel_SERIAL (Failed)
        102 - MechanicsWithHydrogenOrthogonal_SERIAL (Failed)
        130 - SurfaceElement_SERIAL (Failed)
        183 - SteadyHeat2DInternalNeumann (Failed)

I spot checked one of them and here is how it fails:

2: STKDisc: nodeset bottom_right has size 0  on Proc 0.
2: STKDisc: nodeset left has size 0  on Proc 0.
2: STKDisc: nodeset right has size 8  on Proc 0.
2: STKDisc: nodeset top_left has size 2  on Proc 0.
2: STKDisc: nodeset top_right has size 8  on Proc 0.
2: STKDisc: sideset surface_1 has size 7  on Proc 0.
2: bulkData.num_elements(sidee) != 1 ALBANY_PANIC condition at /home/ikalash/ACE/Albany-LCM-master/src/disc/stk/Albany_STKDiscretization.cpp +2116
2: STKDisc: cannot figure out side set topology for side set surface_1

Somehow the sideset topology seems to be getting messed up with the line commented out.

alanw0 commented 4 years ago

Nope, that's the same error we were discussing above. What's going on is, when that line 83 is commented out, then the mesh is fully (correctly) connected. These meshes have sidesets in the middle of the mesh, i.e., side (face) entities are sandwiched between elements. albany doesn't know how to pick which side (which element-side pair) is the correct one for the sideset. I'm going to a) make sure stk is connecting the mesh correctly, and b) give you some code that enables you to pick the correct elem-side pair for a side that is internal to a contiguously-connected mesh. I'll get back to you soon.

alanw0 commented 4 years ago

@ikalash update: we're still looking at this. Some of those tests have actual double-sided sidesets, whereas others are possibly being incorrectly manipulated by stk. We'll keep you posted, when we have a good solution.

lxmota commented 4 years ago

@alanw0 Some of the double-sided sidesets are deliberate. Those are in there to mimic a moving pressure boundary condition. The way that is done is to add to the sideset all nodes that are on the potential path of the moving surface where the pressure boundary condition is to be applied. The boundary surface moves due to erosion of the material and the corresponding mesh adaptation that mimics the erosion. Within Albany, there is logic to determine if the nodes corresponding to the sideset are on the actual boundary, and if so, then the pressure boundary condition is applied to them. Hopefully this isn't too problematic.

alanw0 commented 4 years ago

Thanks @lxmota . So how many tests should I generally expect to pass or fail if I just do 'make Albany' and 'ctest'? When I back out some changes that I've been experimenting with, I'm still seeing some tests failing. If your tests are all expected to pass, I'll make sure I'm exactly in sync with trilinos/develop and then go from there.

ikalash commented 4 years ago

All tests should pass. They do in our nightlies.

alanw0 commented 4 years ago

Thanks @ikalash

alanw0 commented 4 years ago

ok, sorry for how long this took, but this issue should be fixed by #8323 .

ikalash commented 4 years ago

I can confirm that this issue has been resolved in Albany! Sorry for the delay commenting - I somehow missed the original comment 3 days ago about the fix. Thanks so much for getting this resolved for us @alanw0 ! It will be important for our Arctic coastal erosion simulations.

alanw0 commented 4 years ago

Thanks @ikalash