libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
659 stars 286 forks source link

I seem to find some problems with uniqueID? #2484

Closed ermore closed 4 years ago

ermore commented 4 years ago
// I understand that elems returned by  the function (mesh.active_element_ptr_range() ) include local elems and ghost elems
// in each process The activate Elem should have a different unqiueid number

When I compile and execute the following code with command mpiexec -n 2 ./ex:


int main(int argc, char **argv)
{
    LibMeshInit init(argc, argv);
    Mesh mesh(init.comm(), 3);

    MeshRefinement mesh_refinemesh(mesh);

    int K = 3;
    MeshTools::Generation::build_cube(mesh,
                                      4 * K,
                                      2 * K,
                                      2 * K,
                                      0., 1. * 2,
                                      0., 1,
                                      0., 1,
                                      HEX8);

    // I understand that elems returned by  the function (mesh.active_element_ptr_range() ) include local elems and ghost elems
    // in each process The activate Elem should have a different unqiueid number

    int adapt = 1;
    for (adapt = 1; adapt < 4; ++adapt)
    {
        libMesh::out << "adaption:" << adapt << std::endl;
        for (auto &elem : mesh.active_local_element_ptr_range())
        {
            if (elem->processor_id() == 0)
            {
                elem->set_refinement_flag(Elem::REFINE);
            }
        }
        MPI_Barrier(MPI_COMM_WORLD);
        mesh_refinemesh.refine_and_coarsen_elements();

        processor_id_type processor_rank = mesh.comm().rank();
        processor_id_type processor_size = mesh.comm().size();
        std::set<unique_id_type> uniqueSet;

        unsigned int active_elem_size = std::distance(mesh.active_elements_begin(), mesh.active_elements_end());

        for (auto const &elem : mesh.active_element_ptr_range())
        {
            uniqueSet.insert(elem->unique_id());
        }

        std::cout << "rank: " << processor_rank << " uniqueSset size: " << uniqueSet.size() << " ; activate size: " << active_elem_size << std::endl;
    }

    return 0;
}

I get the following results:

adaption:1
adaption:1
adaption:1
rank0 uniqueSset size: 684 ; activate size: 684
adaption:2
rank1 uniqueSset size: 621 ; activate size: 621
adaption:2
rank2 uniqueSset size: 706 ; activate size: 706
adaption:2
rank0 uniqueSset size: 2126 ; activate size: 2126
adaption:3
rank1 uniqueSset size: 2140 ; activate size: 2140
adaption:3
rank2 uniqueSset size: 2350 ; activate size: 2350
adaption:3
rank2 uniqueSset size: 6416 ; activate size: 6416
rank1 uniqueSset size: 7295 ; activate size: 7295
rank0 uniqueSset size: 6892 ; activate size: 6897

I find that the uniqueSset size of process 0 is inconsistent with its own size when the number of adaptions is greater than 2. I think they should be the same, but why are they different?

roystgnr commented 4 years ago

That's ... not an obvious question. Looking into it.

roystgnr commented 4 years ago

Ah, I see. Try running in debug or devel modes and this code dies with

Assertion `flags_were_consistent' failed.

Note what you're doing with those refinement flags: you're setting REFINE iff the element pid is 0 and if the element is local. So the same elements are marked for refinement on rank 0 but not on other ranks, so the mesh goes out of sync.

roystgnr commented 4 years ago

Hmm... but changing that loop to be over all active elements is only fixing things for me on ReplicatedMesh, not DistributedMesh. Let me see if I can figure out what's going on in that case; the modified code isn't throwing an assertion.

ermore commented 4 years ago

Yes, I use DistributeMesh for calculation. When I use mesh.active_element_ptr_range(), the result is the same. I don't know what happened? I guess there was an error in elem->unique_id().

roystgnr commented 4 years ago

Okay, I think I've found the bug. It's DistributedMesh-specific and triggered by some parallel AMR corner cases, so I'm guessing the underlying problem is that the test coverage in MOOSE doesn't do enough large distributed adaptivity problems with unique_id-using components, while IIRC the test coverage in libMesh is even worse: it doesn't hit unique_id at all except to assert parallel consistency of individual DofObjects.

The bugfix looks easy but expanding our test coverage enough to make sure the fix is really working (and that there's not a second bug somewhere...) might take a little while.

roystgnr commented 4 years ago

When I use mesh.active_element_ptr_range(), the result is the same.

Well, it's usually the same, but there are corner cases; if you crank up K and/or n_processors enough, you'll see a bug again. Are you currently relying on DistributedMesh+unique_id? If so then I'll push a likely-bugfix right away so you can at least have a little more safety while I investigate deeper.

ermore commented 4 years ago

Yes, I do rely on DistributedMesh+unique_id. Please fix this bug. thanks!

roystgnr commented 4 years ago

You'll probably want to use #2486; it should cherry-pick cleanly onto whatever version you're using currently.

It fixes all the tests I've thrown at it manually, but this bug is so embarrassing that I don't want to close this ticket until I've got more heavyweight tests running in CI to make sure there aren't any broken cases I'm still missing. Thanks so much for helping catch this; I'm quite happy that I didn't just see the bug in your code and stop looking at that point.

ermore commented 4 years ago

Yes, it seems to work. Thank you. If I find other questions about unique_id, I will continue to ask here.

roystgnr commented 4 years ago

I think #2491 should have fixed the last issues with unique_id in DistributedMesh.