Closed ermore closed 4 years ago
That's ... not an obvious question. Looking into it.
Ah, I see. Try running in debug or devel modes and this code dies with
Assertion `flags_were_consistent' failed.
Note what you're doing with those refinement flags: you're setting REFINE iff the element pid is 0 and if the element is local. So the same elements are marked for refinement on rank 0 but not on other ranks, so the mesh goes out of sync.
Hmm... but changing that loop to be over all active elements is only fixing things for me on ReplicatedMesh, not DistributedMesh. Let me see if I can figure out what's going on in that case; the modified code isn't throwing an assertion.
Yes, I use DistributeMesh for calculation. When I use mesh.active_element_ptr_range()
, the result is the same. I don't know what happened? I guess there was an error in elem->unique_id()
.
Okay, I think I've found the bug. It's DistributedMesh-specific and triggered by some parallel AMR corner cases, so I'm guessing the underlying problem is that the test coverage in MOOSE doesn't do enough large distributed adaptivity problems with unique_id-using components, while IIRC the test coverage in libMesh is even worse: it doesn't hit unique_id at all except to assert parallel consistency of individual DofObjects.
The bugfix looks easy but expanding our test coverage enough to make sure the fix is really working (and that there's not a second bug somewhere...) might take a little while.
When I use mesh.active_element_ptr_range(), the result is the same.
Well, it's usually the same, but there are corner cases; if you crank up K and/or n_processors enough, you'll see a bug again. Are you currently relying on DistributedMesh+unique_id? If so then I'll push a likely-bugfix right away so you can at least have a little more safety while I investigate deeper.
Yes, I do rely on DistributedMesh+unique_id. Please fix this bug. thanks!
You'll probably want to use #2486; it should cherry-pick cleanly onto whatever version you're using currently.
It fixes all the tests I've thrown at it manually, but this bug is so embarrassing that I don't want to close this ticket until I've got more heavyweight tests running in CI to make sure there aren't any broken cases I'm still missing. Thanks so much for helping catch this; I'm quite happy that I didn't just see the bug in your code and stop looking at that point.
Yes, it seems to work. Thank you. If I find other questions about unique_id
, I will continue to ask here.
I think #2491 should have fixed the last issues with unique_id in DistributedMesh.
When I compile and execute the following code with command
mpiexec -n 2 ./ex
:I get the following results:
I find that the uniqueSset size of process 0 is inconsistent with its own size when the number of adaptions is greater than 2. I think they should be the same, but why are they different?