idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.71k stars 1.04k forks source link

BreakMeshByBlock robustness #12033

Closed jiangwen84 closed 2 years ago

jiangwen84 commented 6 years ago

Rationale

There are two main issues with existing BreakMeshByBlock

Description

resolve those issues in BreakMeshByBlock.

Impact

Improvement and bug fixes.

lindsayad commented 4 years ago

It seems that BreakMeshByBlock(Generator) suffers from a fair number of robustness issues. Based on @arovinelli's comment here it does not work with recover (at least when using InterfaceKernels?) nor with DistributedMesh. Additionally, according to the test suite it doesn't work in parallel in debug mode either. It would be nice if we could make this object a bit more robust such that it can work in all the traditional MOOSE scenarios.

arovinelli commented 4 years ago

@lindsayad making it work with distributed mesh should be double even if a bit cumbersome. For the restart I believe the major problem is that the mesh connectivity is not saved there ant it might be difficult to reconstruct it when the mesh is already split (if we have large dispalcements between the two blocks is going to be a impossible). Any idea on how to solve this?

It seems that BreakMeshByBlock(Generator) suffers from a fair number of robustness issues. Based on @arovinelli's comment here it does not work with recover (at least when using InterfaceKernels?) nor with DistributedMesh. Additionally, according to the test suite it doesn't work in parallel in debug mode either. It would be nice if we could make this object a bit more robust such that it can work in all the traditional MOOSE scenarios.

permcody commented 4 years ago

Closing this specific issue based on the merged PRs listed.

lindsayad commented 4 years ago

Err I don't think this should be closed, #14216 specifically references this issue because it is still an issue

arovinelli commented 4 years ago

@lindsayad and @permcody I need to run some big jobs on a cluster and therefore I need to fix at the least to enable the distributed mesh option. I guess that the problem here is to sync different MPI process to add node and update element simultaneously. Can you point me to any example doing something similar to give me an example to follow and simplify my work? Of course if Ica n fix some issue this will be merged. Thanks again

permcody commented 4 years ago

@arovinelli - You can jump around other generators and see if others have been updated. It can be pretty complex.

Here's a better idea. Instead of trying to get it working with DistributedMesh, I recommend that you just use the "split mesh" capability in MOOSE. The short explanation is that you run a job with a few extra CLI options that will just run the mesh steps (reading it in, generating it, applying transformations, etc) and will write it back out as separate files ready to run for a larger distributed run. The nice part is when you do this, the memory used by your larger run and the startup time will be drastically reduced. Take a look here:

https://www.mooseframework.org/syntax/Mesh/splitting.html

arovinelli commented 4 years ago

@permcody thanks for the reply this is good to know. One question would the file written from the mesh split contain the face-face neighbor information, or it will be reconstructed later? If it is reconstructed later than I need a way to reassign the correct neighbors

permcody commented 4 years ago

face-face neighbor information? I assume you mean the new internal boundaries you are trying to add?

The splitter system is designed to run all mesh setup tasks including running RelationshipManagers. So if you have Generators that add boundaries that information will be saved. If you have objects that require a "wider stencil", those ghost elements should be written. Ideally everything you need should be preserved.

arovinelli commented 4 years ago

@permcody face-face neighbor information I mean elem->face->neighbor_elem information The BreackMeshByBlockGenerator add nodes such that neighboring element on two different block do not share the same nodes anymore. So the mesh is in fact broken, but the elem->face->neighbor_elem information are still there. If the elem->face->neighbor_elem info are not saved by the mesh split, I need to reconstruct it be cause otherwise the interface will not work anymore

arovinelli commented 4 years ago

@permcody and @lindsayad so I tried to use the split mesh tool, howeve when treying use the split mesh things brake this is the output

*** Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x0000000000f35f10 ***
*** Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x000000000228aff0 ***
*** Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x00000000024d4270 ***
permcody commented 4 years ago

This might mean that you don't have enough ghosting saved in the split mesh file. It's important that each object in the simulation advertises its individual needs as far as accessing nodes, elements, doc indices etc outside of its partition. Does this sim run with Distributed Mesh? That's worth checking before continuing down the Split Mesh path.

On Tue, Nov 19, 2019 at 12:30 PM Andrea Rovinelli notifications@github.com wrote:

@permcody https://github.com/permcody and @lindsayad https://github.com/lindsayad so I tried to use the split mesh tool, howeve when treying use the split mesh things brake this is the output

Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x0000000000f35f10 Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x000000000228aff0 Error in `../../../../tensor_mechanics-opt': malloc(): memory corruption (fast): 0x00000000024d4270

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/idaholab/moose/issues/12033?email_source=notifications&email_token=AAXFOIGESJJOYTKIYM5AU6TQUQPCJA5CNFSM4FRHHAFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPBBSQ#issuecomment-555618506, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFOIA5HD4DYKLZABIT2XTQUQPCJANCNFSM4FRHHAFA .

arovinelli commented 4 years ago

@permcody it will not as

https://github.com/idaholab/moose/blob/5cb1f32e014b750c4815b85a41310b3d80852929/framework/src/meshgenerators/BreakMeshByBlockGenerator.C#L34-L39

permcody commented 4 years ago

OK to be clear, Split Mesh is always for Distributed Mesh runs. We could probably improve error messaging along those lines. This Generator will need to be updated to work for DistributedMesh AND include appropriate RelationshipManager additions before you can expect it to work with split mesh.

On Tue, Nov 19, 2019 at 1:25 PM Andrea Rovinelli notifications@github.com wrote:

@permcody https://github.com/permcody it will not as

https://github.com/idaholab/moose/blob/5cb1f32e014b750c4815b85a41310b3d80852929/framework/src/meshgenerators/BreakMeshByBlockGenerator.C#L34-L39

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/idaholab/moose/issues/12033?email_source=notifications&email_token=AAXFOIB5ED2XJFGOUMSDVP3QUQVP5A5CNFSM4FRHHAFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPG55A#issuecomment-555642612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFOIDM4A6OLNKFNF3AIXTQUQVP5ANCNFSM4FRHHAFA .

arovinelli commented 4 years ago

@permcody thanks for the reply I believe it would be much easier going the other way around start from a disconnected mesh and join coincident faces (e.g. the one having nodes sharing the same locations). I'll work on that

permcody commented 4 years ago

I believe it would be much easier going the other way around start from a disconnected mesh and join coincident faces (e.g. the one having nodes sharing the same locations).

It shouldn't be more difficult one way or the other. When working with a DistributedMesh in the MeshGenerator system, you'll have access to the full mesh on each processor (well in almost all cases). The way you deal with DistributedMesh is that you inspect the "processor_id" on each element to know whether the current processor owns it or not.

arovinelli commented 4 years ago

Ok, I didn't know that. I Have a couple of questions?

  1. when adding a new node, which processor should be in charge of adding the new node? the processor owning the element to which the node will belong?
  2. Also once a node has been added, how will I broadcast this information to all the processors?

It shouldn't be more difficult one way or the other. When working with a DistributedMesh in the MeshGenerator system, you'll have access to the full mesh on each processor (well in almost all cases). The way you deal with DistributedMesh is that you inspect the "processor_id" on each element to know whether the current processor owns it or not.

permcody commented 4 years ago

During the generation phase you want the mesh to be consistent on all processors. All processors will have to add the same Node with the same IDs. If anything gets out of sync, you'll run into problems. I should say that it is possible to generate a DistributedMesh where each processor truly works with only their portion of the mesh, but that's very advanced and we aren't doing that in more than maybe one place in MOOSE.

If you need to perform parallel communication you may do so. However, hopefully since you are doing the same work on all processors you won't need to worry about this step.

arovinelli commented 4 years ago

If all processors have access to all the mesh and all processor must do the same thing, why should I be worried if a node and or element is remote or local? Are there any caveats for remote node and elements?

During the generation phase you want the mesh to be consistent on all processors. All processors will have to add the same Node with the same IDs. If anything gets out of sync, you'll run into problems. I should say that it is possible to generate a DistributedMesh where each processor truly works with only their portion of the mesh, but that's very advanced and we aren't doing that in more than maybe one place in MOOSE.

If you need to perform parallel communication you may do so. However, hopefully since you are doing the same work on all processors you won't need to worry about this step.

permcody commented 4 years ago

If all processors have access to all the mesh and all processor must do the same thing, why should I be worried if a node and or element is remote or local? Are there any caveats for remote node and elements?

I'll admit that I'm not sure we've fleshed out every single possible case when creating a tree of generators. We are still polishing up this system and trying to create documentation and guidance for everyone to use. My concern is that you might generate a sequence of generators A -> B -> C. It might turn out that you need to call the "prepare_for_use()" method on the Mesh between the stages of the generator because you need to call certain "exploration methods" like retrieving neighboring elements of elements, etc. I'm not an expert at what you can and can't do but many methods aren't available for use until you've done an intermediate prepare. However, with DistributedMesh once you do that, I believe (again need to test and verify) that libMesh will go ahead and throw away remote elements.

Now you get into one of the later stages and you no longer have all of the elements available on all processors. Now if you plan to add new elements/nodes everything gets significantly more complicated. You no longer need to do the same work on all ranks, but you likely can't just do the work on one rank either. An example would be where you were going to add new elements on a free surface right on a processor boundary. Multiple processors would need to work together to add a new element due to ghosting.

For now, we've kind of glossed over some of these really nasty edge cases (literal) and just considered the more normal case of having the full mesh available on all ranks during the generation phase. That'll satisfy nearly every case and if you are willing to work with Split Mesh, perhaps 100% of cases. Yeah we are figuring this all out as we go to. It's all pretty new and you are once again working on the bleeding edge.

For now, I would recommend that you assume you have all information on all ranks and just do the same work everywhere. In the future we will continue to think about how to make this easier for our developers and give them the tools they need to deal with truly Distributed cases.

lindsayad commented 4 years ago

However, with DistributedMesh once you do that, I believe (again need to test and verify) that libMesh will go ahead and throw away remote elements.

By default it will, but there is a method to stop this from happening: MeshBase::allow_remote_element_removal

permcody commented 4 years ago

By default it will, but there is a method to stop this from happening: MeshBase::allow_remote_element_removal

True, but is this something we want to always turn on during MeshGeneration? Perhaps, but maybe not. Design designs that just haven't been made.

arovinelli commented 4 years ago

@permcody @lindsayad so I'm trying to figure out what's going on in this generator with distributed mesh.

I compiled in debug and used valgrind and run with distributed mesh (after commenting the top moose error) and the following line is cause of a memory leak

https://github.com/idaholab/moose/blob/5cb1f32e014b750c4815b85a41310b3d80852929/framework/src/meshgenerators/BreakMeshByBlockGenerator.C#L100

any idea how to fix this?

lindsayad commented 4 years ago

True, but is this something we want to always turn on during MeshGeneration? Perhaps, but maybe not. Design designs that just haven't been made.

It seems like you should only delete remote elements once the mesh has been fully generated. That's certainly the safest. If I'm doing contact for example, with the current design there is no way for me to properly generate a mortar interface (with the proper ghosting) unless I've pre-built the lower dimensional bocks in my file mesh. This is because if I try to add lower dimensional blocks via LowerDBlockFromSidesetGenerator, I'll be unable to generate the proper ghosting between slave and master lower dimensional blocks because more than likely the parents were deleted during mesh->read within the FileMeshGenerator. (Yes UnstructuredMesh::read calls prepare_for_use)

lindsayad commented 4 years ago

any idea how to fix this?

Did you add this new node to the mesh? You've performed a dynamic memory allocation here. Somebody has to manage this new resource and make sure the dynamic memory that new_node points to gets deleted.

arovinelli commented 4 years ago

Yes it is added to the mesh

https://github.com/idaholab/moose/blob/5cb1f32e014b750c4815b85a41310b3d80852929/framework/src/meshgenerators/BreakMeshByBlockGenerator.C#L100-L102

actually even line 102 generate a leak most likely because of line 100

Did you add this new node to the mesh? You've performed a dynamic memory allocation here. Somebody has to manage this new resource and make sure the dynamic memory that new_node points to gets deleted.

permcody commented 4 years ago

libMesh is still working on updating its interfaces to C++11 standards. Allocating a new node and adding it to the mesh is the way it's been done since libMesh was created. Now in this intermediate stage, it looks very weird since there's a mix of smart and non-smart pointer interfaces. The creation of a new Node through a smart pointer requires that we release the manager before adding it to the old (manually managed) interface. As far as I can tell, this is all correct. libMesh expects you to pass it a Node (or Element) through a pointer that it will then manage.

lindsayad commented 4 years ago

Yea, I'm gonna be honest: I don't see anything that you're doing wrong here. Even if the new node's proc_id doesn't match the current rank, DistributedMesh should be properly deleting it during delete_remote_elements. Maybe @roystgnr has some ideas

YaqiWang commented 4 years ago

I actually do not quite like the fact of libMesh having DistributedMesh and ReplicatedMesh. A simple flag indicating if the mesh is distributed should work imo. But this is a design choice made long ago, I guess we have no way to change it. We can also separate the remote element removal out from prepare_for_use and call it distribute. That might be helpful. BTW, Rattlesnake has one similar modifier for breaking meshes. I am wondering if that one has the similar issue or not. We used that one for our multi-scheme transport calculations.

arovinelli commented 4 years ago

Some other insights

from the opt test with --distributed-mesh and `min_parallel = 3 I receive the following

3d_auto_test_distributed: Writing Exodus 
3d_auto_test_distributed: Memory leak detected!
3d_auto_test_distributed: Compile in DEBUG mode with --enable-reference-counting
3d_auto_test_distributed: for more information

and the resulting exodus is actually missing nodes

3d_auto_test_distributed: ERROR: 
3d_auto_test_distributed:    *****************************************************************
3d_auto_test_distributed:               EXODIFF (Version: 2.90) Modified: 2018-02-15
3d_auto_test_distributed:               Authors:  Richard Drake, rrdrake@sandia.gov           
3d_auto_test_distributed:                         Greg Sjaardema, gdsjaar@sandia.gov          
3d_auto_test_distributed:               Run on    2019/11/20   09:52:07 CST
3d_auto_test_distributed:    *****************************************************************
3d_auto_test_distributed: 
3d_auto_test_distributed: Reading first file ... 
3d_auto_test_distributed: Reading second file ... 
3d_auto_test_distributed:   FILE 1: sh_by_block_generator/gold/break_mesh_3D_auto_in.e
3d_auto_test_distributed:    Title: break_mesh_3D_auto_in.e
3d_auto_test_distributed:           Dim = 3, Blocks = 3, Nodes = 42, Elements = 8, Nodesets = 6, 
3d_auto_test_distributed:           Vars: Global = 0, Nodal = 0, Element = 0, Nodeset = 0, Sideset = 
3d_auto_test_distributed: 
3d_auto_test_distributed:   FILE 2: sh_by_block_generator/break_mesh_3D_auto_in.e
3d_auto_test_distributed:    Title: break_mesh_3D_auto_in.e
3d_auto_test_distributed:           Dim = 3, Blocks = 3, Nodes = 40, Elements = 8, Nodesets = 6, 
3d_auto_test_distributed:           Vars: Global = 0, Nodal = 0, Element = 0, Nodeset = 0, Sideset = 
3d_auto_test_distributed: 
3d_auto_test_distributed: exodiff: ERROR: .. Number of nodes doesn't agree.
3d_auto_test_distributed: 
3d_auto_test_distributed: exodiff: Files are different
3d_auto_test_distributed: 
3d_auto_test_distributed:  
3d_auto_test_distributed: ####
3d_auto_test_distributed: Tester failed, reason: EXODIFF
3d_auto_test_distributed: 

with

from the opt test with --distributed-mesh and min_parallel = 2 MOOSE crashes

libMesh is still working on updating its interfaces to C++11 standards. Allocating a new node and adding it to the mesh is the way it's been done since libMesh was created. Now in this intermediate stage, it looks very weird since there's a mix of smart and non-smart pointer interfaces. The creation of a new Node through a smart pointer requires that we release the manager before adding it to the old (manually managed) interface. As far as I can tell, this is all correct. libMesh expects you to pass it a Node (or Element) through a pointer that it will then manage.

permcody commented 4 years ago

I actually do not quite like the fact of libMesh having DistributedMesh and ReplicatedMesh.

It's true that you could use the algorithms in DistributedMesh in a replicated mode but they are heavier and slower, requires additional communication, etc. ReplicatedMesh serves most people very well and isn't likely to go anywhere anytime soon. Everything about the internals are simpler and more efficient for smaller problems. These two objects serve very different needs and simulation types.

roystgnr commented 4 years ago

It's true that you could use the algorithms in DistributedMesh in a replicated mode but they are heavier and slower, requires additional communication, etc. ReplicatedMesh serves most people very well and isn't likely to go anywhere anytime soon. Everything about the internals are simpler and more efficient for smaller problems. These two objects serve very different needs and simulation types.

I was about to type this, except with more conflict-of-interest. :-) We could go through DistributedMesh bit by bit and make sure it always uses the replicated algorithms when it's in a replicated state, but that would still have more overhead from the underlying containers. It would also make replicated meshes less useful as "training wheels", if changing a flag here or there could suddenly drop you into a full distributed state.

Case in point. When you do new_node = Node::build(*current_node, mesh->n_nodes()).release(), mesh->n_nodes() is very unlikely to be the id number you want. To account for simultaneous asynchronous additions of both partitioned and unpartitioned nodes, DistributedMesh typically assigns ids in a sparse fashion (renumbering in prepare_for_use() to compress out the gaps), and if you're picking out your own ids then you need to either use that (e.g. leaving the id() unset and adding unpartitioned nodes in sync then partitioned nodes only on their processor) or at least avoid conflict with that (which can be as easy as storing an up-to-date max_node_id() before your modifications and using that as an offset, but the details depend on exactly what you're doing).

lindsayad commented 4 years ago

Case in point. When you do new_node = Node::build(*current_node, mesh->n_nodes()).release(), mesh->n_nodes() is very unlikely to be the id number you want.

@roystgnr Are you suggesting that this could be the cause of the leak?

roystgnr commented 4 years ago

I'd guess so, but I still can't figure out how. Most of the obvious ways that that could cause a leak are covered with libmesh_assert() - you can't just try to add_node on top of a different existing node, for example. I assume someone's verified the leak in dbg or devel mode already?

arovinelli commented 4 years ago

All, I'm playing with what Roger suggested (at the moment I just added a shift of 50 and added a counter) and tests passe in both standard and debug mode with distributed options and different mpi ranks and no leak detected. The question now is do I really need to have the same id for the new nodes on each cpu or not??

lindsayad commented 4 years ago

Roger

You mean Roy?

The question now is do I really need to have the same id for the new nodes on each cpu or not??

If I understand correctly, your code should either be:

if (current_node->processor_id() == mesh->processor_id())
{
  // We assign the processor_id so we should only add for this rank
  new_node = Node::build(*current_node, DofObject::invalid_id).release(); 
  new_node->processor_id() = current_node->processor_id();
  mesh->add_node(new_node);
}

or

  // Uses the Point and ID constructor. Let the DistributedMesh figure
  // out the ID and the partition for you
  new_node = Node::build(*current_node, DofObject::invalid_id).release(); 
  mesh->add_node(new_node);

@roystgnr can hopefully correct me if I'm wrong...

roystgnr commented 4 years ago

do I really need to have the same id for the new nodes on each cpu or not?

The documentation says "Only [manually set id] in parallel if you are manually keeping ids consistent", and I'd love to stop at that, but of course it's not actually possible during distributed mesh refinement, so there are cheats possible. What I do for AMR is apply temporary ids first and later use MeshCommunication::make_node_ids_parallel_consistent(), which syncs the "canonical" ids on the owning processors to all the ghost nodes, identifying equivalent nodes by a same-processor-id element connected to them. But that method relies on sync_node_data_by_element_id(), which is expensive enough that you'd want to avoid it if you can.

Sadly I don't think either of the methods @lindsayad pointed out will work in general. In the first code, each processor will still be missing a ghosted copy of other processor's nodes. In the second, you have to add unpartitioned nodes in sync (the same way you'd have to on a ReplicatedMesh) or their ids won't end up in sync either.

The ideal solution is to manually set nodes using some "canonical" ordering that doesn't require communication. For instance, I recall writing mesh extrusion code to simply increment nodes' ids by max_node_id for each layer you extrude through, thereby making the new ids unique and independent of what processor sets them. (although I can't recall, what app was that code in?) In MOOSE the MeshCollectionGenerator works this way: by using max_node_id as an offset to copy_nodes_and_elements it guarantees canonical and unique ids there. I'll see if I can find other examples.

arovinelli commented 4 years ago

@roystgnr and @lindsayad thanks for the insight and all the reply. The only way I can see this working is to follow a synchronous process involving MPI communication which would read as follow

  1. Each Rank should identify all the nodes that need to be replicated and constructing a map
  2. One rank should grab all the information about the new nodes by all the other ranks, and create canonical ids for the new node
  3. redistribute the entire information to all the ranks
  4. Each rank add nodes with conformal ID.

or

  1. use the same algorithm while keeping track of the added nodes and making sure they are unique 2. join the maps
  2. renumber ids
roystgnr commented 4 years ago

Hmm... you know max_node_id. And n_subdomains-1 should be the most copies you need of any node, so if max_node_id*n_subdomains fits in a dof_id_type then you shouldn't need any communication at this stage. Can you assign

new_node_id = (connected_subdomain_id+1) * max_node_id + old_node_id

And then wait for the libMesh prepare_for_use to give you a dense numbering?

arovinelli commented 4 years ago

@roystgnr thanks for the idea I can try and yes I can wait for prepare_for_use to do its job. A couple of questions:

  1. where should I put prepare_for_use();
  2. by max_node_id you the total number of node in the entire mesh i guess
  3. what is the maximum for dof_id_type?
roystgnr commented 4 years ago
  1. It's been too long since I edited any MOOSE mesh generators code, but IIRC there's already a prepare_for_use() down the line and you won't have to add it again.
  2. Nope. It's a separate MeshBase method. If you have sparse numbering for any reason then it'll exceed the number of nodes in the mesh.
  3. By default 4 bytes unsigned, so 4 billion ish. MOOSE works with 8 byte dof_id_type too, but I'm not sure how many of your users are using 4 vs 8.
arovinelli commented 4 years ago

@roystgnr unfortunately I'm not able to find what is wrong with my changes You cn use my branch for detailed error info 2 things happen:

  1. all tests in debug fails
  2. when using the --distributed-mesh results depends on the number of CPU there is some random crash depending on the number of CPU used (test break_mesh_by_blocks_generator_3d_auto_test_distributed works with 3 CPU and crashes with two)

typical error messages are

Assertion `!n->valid_id() || n->id() == _nodes.size()' failed.
2d_polycrystal_test: 
2d_polycrystal_test: [0] /home/arovinelli/projects/NEW_stuff/moose/scripts/../libmesh/src/mesh/replicated_mesh.C, line 476, compiled Oct 22 2019 at 13:10:38
2d_polycrystal_test: Assertion `!n->valid_id() || n->id() == _nodes.size()' failed.

and

2d_splittrue_test: Assertion `this->comm().semiverify(obj ? &uniqueid : nullptr)' failed.
2d_splittrue_test: 
2d_splittrue_test: Assertion `this->comm().semiverify(obj ? &uniqueid : nullptr)' failed.
2d_splittrue_test: Assertion `this->comm().semiverify(obj ? &uniqueid : nullptr)' failed.
2d_splittrue_test: 
2d_splittrue_test: 
2d_splittrue_test: [2] buted_mesh.C, line 956, compiled Oct 22 2019 at 13:10:36
2d_splittrue_test: [0] buted_mesh.C, line 956, compiled Oct 22 2019 at 13:10:36
2d_splittrue_test: [1] buted_mesh.C, line 956, compiled Oct 22 2019 at 13:10:36
roystgnr commented 4 years ago

That 2d_polycrystal_test failure looks like you're trying to do this id cleverness on ReplicatedMesh as well, which unfortunately won't work - on ReplicatedMesh the id isn't just a mapping key, it's actually an index into a vector of pointers. But there the fix is easy: just leave the id as invalid_id and as long as you're adding objects in sync you're fine.

The 2d_splittrue_test - I'm not sure without more context, but it looks like the new nodes' unique_id() values aren't in sync and our dbg test is complaining. We might have to actually manually sync those before the prepare_for_use(), now that I think about it. Let me see if I can replicate.

roystgnr commented 4 years ago

Yeah, I can replicate the problem, and I think it can be avoided. Let me try a possible fix.

roystgnr commented 4 years ago

Well, I got a fix for that particular test at least. Try hammering my fork of your branch and see if it works more generally?

arovinelli commented 4 years ago

@roystgnr Thanks for spending time on this.

unfortunately the solution is not general:

  1. opt and replicated everything is ok

  2. debug and replicated i have

    2d_splittrue_test [min_cpus...] OK
    3d_auto_test_distributed [skipped dependenc...] SKIP
    3d_split_test .... FAILED (ERRMSG)
    2d_polycrystal_test [min_cpus=3,OV...] FAILED (ERRMSG)
    3d_polycrystal_test [min_cpus=3,OV...] FAILED (ERRMSG)
    3d_auto_test ..... FAILED (ERRMSG)
    2d_auto_test ..... FAILED (ERRMSG)
  3. distributed -- optthe large tests fail (3d_polycrystal_test and 2d_polycrystal_test) hitting the same assertion

    3d_polycrystal_test: [ 2] ***ASSERTION failed on line 243 of file /home/arovinelli/projects/petsc-3.10.5/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/match.c:k >= firstvtx && k < lastvtx
    3d_polycrystal_test: [ 2] 759 1138 -220624857 0 1
  4. distributed -- dbg the large tests fail in the same way as opt distributed.

roystgnr commented 4 years ago

Ah, I see where I screwed up the replicated code path. Easy fix, I just misunderstood one of our old APIs.

Not sure yet about the distributed code path, but I can at least reproduce the failures, let me look closer.

roystgnr commented 4 years ago

Okay, I at least see what the problem is; a side boundary id is getting generated differently on different processors.

roystgnr commented 4 years ago

Yup. The invalid boundary_ids correspond to the same name, but they got assigned different boundary_id_type values.

roystgnr commented 4 years ago

Yeah, here we are:

BreakMeshByBlockGeneratorBase::findBoundaryNameAndInd
...
  // TODO need to be updated if distributed mesh is implemented
  // comments are left to ease implementation

Hope this gets you pointed in the right direction? I'll push the correction to my branch that gets the id+unique_id problems solved, but I'm not going to have time to look at this boundary_id issue more today.

arovinelli commented 4 years ago

@roystgnr thank you so much for the help, I think I can handle it from here. By the way I completely forgot about that comment that I wrote in the original mesh modifier. I'll keep it from here Thanks a lot again for the help and for time spent on this!!

Yeah, here we are:

BreakMeshByBlockGeneratorBase::findBoundaryNameAndInd
...
  // TODO need to be updated if distributed mesh is implemented
  // comments are left to ease implementation

Hope this gets you pointed in the right direction? I'll push the correction to my branch that gets the id+unique_id problems solved, but I'm not going to have time to look at this boundary_id issue more today.