Open ikalash opened 2 years ago
@alanw0 : I was wondering if you might have any insight into this issue? Basically we're running a really simple cuboid geometry in Albany, given by the following .jou file: https://github.com/sandialabs/LCM/blob/main/tests/LCM/ACE/MiniErosion/grid/cuboid_denudation.jou . If we add the following 3 lines, which we need for the geometry to be defined correctly and for erosion of the geometry to happen:
the problem does not run in parallel, with the error described in this issue. It happens as soon as the problem starts up, before and computation has been done. Do you have any insight into this from the STK side?
@ikalash we’ll grab that journal file and take a look, I’ll let you know.
Thanks @alanw0 , sounds great!
@ikalash it's taking me a little longer to get to this, we've got a hectic week going on... I'm going to open a sierra ticket to increase the odds that stk-team members can look at this right away.
No problem - thanks @alanw0 !
@ikalash how many processors are you using when you see this crash? I took the journal file and made a mesh with cubit, and read the mesh into a simple stk program (it uses stk-io to read into stk-mesh), and called stk::mesh::create_adjacent_entities, and it runs without any trouble on 2, 3, and 4 processors. I have lines 32-34 in the journal file.
Thanks for checking this @alanw0 . The issue happens for > 1 proc. However, it looks like the problem has to do with the Erosion capabilities, not anything in STK.
@lxmota : if I run this problem thermal-only with Erosion activated, I get the error; if I turn off Erosion, it runs. Also, if I run a similar problem, thermal only through the Thermal 3D problem (which does not have erosion), it runs. Could you please have a look at the issue? I am attaching input files for the three cases I ran, 1 of which fails (thermal_denudation_only.yaml) and 2 of which run (thermal_denudation_only_no_erosion.yaml, input_forward_euler.yaml). input_forward_euler.yaml.txt thermal_denudation_only_no_erosion.yaml.txt thermal_denudation_only.yaml.txt
ok great, I'll cancel the stk ticket.
So I've dug into this a bit more using the method of bisection to try to understand when the problem started. What this revealed is that the following line in the .jou file that generates the input mesh file is the one that is causing this problem:
sideset 2 face with (x_coord <= {0 + eps}) and (y_coord > {-L/2 - eps}) and (y_coord < {L/2 + eps}) and (z_coord > {0 - eps}) and (z_coord < {2*L + eps})
sideset 2 name "bluff_face-erodible"
If I replace this with
sideset 2 surface with (x_coord <= {0 + eps})
sideset 2 name "bluff_face-erodible"
then the test case runs. I verified going back to Dec. 2020 that the same holds (with the first two lines, there's an error, with the second two lines, things work fine).
Here is the kicker: in the denudation test where the error happens, there is no BC being set on bluff_face-erodible. We should therefore be able to just remove that from the .jou file and the test should run, which it does. What is weird is that the final solution is not exactly the same as it was when bluff_face_erodible was a SS in the mesh, which does not make any sense to me. @lxmota , any thoughts?
I'm inclined to just remove bluff_face-erodible from the .jou file and rebaseline the test case, but it is very disturbing to me why the presence of that SS would matter and would influence the solution given that no BC is being set on that SS.
So there is something else that is disturbing. If I remove the definition of bluff_face-erodible in the .jou file and run on 4 procs, then again on 1 proc, the results are different. Attaches is a screenshot - right picture is on 4 procs, left one is on 1 proc. This suggests there is a parallel inconsistency bug, which is disturbing...
We should figure out the dfDT / theta stuff first that seems to be causing the solution to be mesh-dependent, but I wanted to document this parallel issue here for future reference.
Following up on the discussion at this week's ACE meeting: I have verified that the ACE problems run w/o encountering this error w/o erosion, with the caveat that currently Tempus must be used, as Piro assumes adaptation is happening. I will open a separate issue allowing user to run cases with Piro and no erosion for ACI/NH.
@ikalash @alanw0 Sorry to come back to this issue after so long. I refer to Irina's findings about the journal file below:
So I've dug into this a bit more using the method of bisection to try to understand when the problem started. What this revealed is that the following line in the .jou file that generates the input mesh file is the one that is causing this problem:
sideset 2 face with (x_coord <= {0 + eps}) and (y_coord > {-L/2 - eps}) and (y_coord < {L/2 + eps}) and (z_coord > {0 - eps}) and (z_coord < {2*L + eps}) sideset 2 name "bluff_face-erodible"
If I replace this with
sideset 2 surface with (x_coord <= {0 + eps}) sideset 2 name "bluff_face-erodible"
then the test case runs. I verified going back to Dec. 2020 that the same holds (with the first two lines, there's an error, with the second two lines, things work fine).
Here is the kicker: in the denudation test where the error happens, there is no BC being set on bluff_face-erodible. We should therefore be able to just remove that from the .jou file and the test should run, which it does. What is weird is that the final solution is not exactly the same as it was when bluff_face_erodible was a SS in the mesh, which does not make any sense to me. @lxmota , any thoughts?
I'm inclined to just remove bluff_face-erodible from the .jou file and rebaseline the test case, but it is very disturbing to me why the presence of that SS would matter and would influence the solution given that no BC is being set on that SS.
The original journal file creates a mesh with a side set that includes all faces, both on the boundary and interior. The side set is created with:
sideset 2 face with (x_coord <= {0 + eps}) and (y_coord > {-L/2 - eps}) and (y_coord < {L/2 + eps}) and (z_coord > {0 - eps}) and (z_coord < {2*L + eps})
sideset 2 name "bluff_face-erodible"
I emphasize that due to the geometry of the body, the limits in that Cubit statement should include every face in the mesh into the side set. This is the result:
This mesh is correct for our purposes because we need a side set with both boundary and interior faces in order to propagate boundary conditions once we start removing elements on the boundary. But this mesh produces the error in STK referenced above. The issue appears to be the call to stk::mesh::create_adjacent_entities
in this mesh when it is decomposed.
If instead we create the side set with:
sideset 2 surface with (x_coord <= {0 + eps})
sideset 2 name "bluff_face-erodible"
The mesh has a side set that includes only the faces on the boundary. This seems strange because removing the limits on the Y and Z coordinates should still include all faces of the mesh in the side set due to the geometry of the body. But this is the result:
This mesh is not correct for our purposes because the lack of interior faces on the side set will prevent propagation of boundary conditions when boundary elements are removed. It runs without producing the STK error, but the results are not correct.
So there are two issues:
stk::mesh::create_adjacent_entities
on a decomposed mesh.A couple of comments about this.
A newer alternative to create_adjacent_entities is create_all_sides in SkinBoundary.hpp
. This might work better, it's worth a try.
Another point is that sidesets which include sides that are internal to an element block (i.e., not on the boundary between two element blocks) are technically not supported in stk-mesh. Functions like is_positive_sideset_face_polarity
in SidesetUtil.hpp can't work for purely internal sides.
Ah ok. Thanks for the info. It seems that we need to think of a different way to deal with our BCs other than using internal side sets.
The ACE_MiniErosion_Denudation_Parallel is failing with the following cryptic STK-related error:
The error happens immediately after the problem starts, and is therefore not an issue related to erosion. The problem does not appear to be related to recent changes in Albany, as the same error is encountered w/o the changes if the .jou file is modified to merge together the mesh, something that was not being done before (leading to incorrect behavior that fell through the cracks).