libMesh / libmesh

libMesh github repository
http://libmesh.github.io
GNU Lesser General Public License v2.1
654 stars 286 forks source link

Regression in (or triggered by?) Gmsh changes #469

Closed roystgnr closed 9 years ago

roystgnr commented 9 years ago

Building with METHODS=dbg --enable-complex I find that reduced_basis_ex7 succeeds when using cf8fe174f3dfccc2dda54007aa6ba6076366c507 but fails when using 5260d07dc8bd02935f5f4ad43c13994522bece42

This is a "handle mixed-dimension meshes" commit from @jwpeterson, and so there's a small chance this is relevant to #433. The actual failure is in MeshCommunication::broadcast(), and so there's a large chance that the actual error is something in broadcast_packed_range() which merely was triggered by John's new code.

If anyone is familiar with GMsh/.geo/.msh (is this your example, @dknez?), I'd love help scaling down horn.msh to ease debugging.

jwpeterson commented 9 years ago

Can you post the stack trace to this issue? Also, I forget, does one need a complex-enabled PETSc to run this example successfully in --enable-complex mode?

It's definitely possible there are bugs in the commit you mentioned, but I don't see how they would be related to Real/complex... the patch doesn't really do anything significant with Reals or Numbers.

jwpeterson commented 9 years ago

Also -- the referenced commit doesn't have anything to do with mixed-dimension libmesh meshes, only with handling BCs specified on lower-dimensional elements in gmsh files. The lower-dimensional elements themselves don't make it into the final Mesh.

roystgnr commented 9 years ago

The --enable-complex is required to run the whole example, but I can strip out the relevant ifdefs and get the same failure with little more than mesh.read("horn.msh");

The assertion failure is the verify(mesh.n_elem()) at mesh_communication.C:716, and the rest of the stack is just unstructured_mesh.C:769 and reduced_basis_ex7.C.

roystgnr commented 9 years ago

Running "gmsh -2 horn.geo" gives me a much smaller horn.msh file (albeit one with roughly the same number of elements?) but the new file exactly reproduces the original error so I'm not too concerned about being unable to exactly reproduce the original mesh.

jwpeterson commented 9 years ago

The assertion failure is the verify(mesh.n_elem())

Hmm.. are you running in parallel? How many procs?

roystgnr commented 9 years ago

Originally 12, but I get the same failure with 2. 1 works, naturally. Checking things out in the debugger, the number of elements that proc 1 thinks it has is indeed only 90% or so of the number of elements proc 0 should have sent. It's as if some of the elements ended up with duplicate DofObject ids and so just overwrote each other on the receive end.

roystgnr commented 9 years ago

Oh, I'm running with --enable-parmesh, too. Haven't tried to duplicate with SerialMesh yet, although obviously I should...

jwpeterson commented 9 years ago

Oh, I'm running with --enable-parmesh, too.

Haha, wow. OK, so... fails in parallel with --enable-complex --enable-parmesh. Sure you don't want to turn on infinite elements support too? ;-)

The gmsh reader definitely was not tested extensively in this configuration...

jwpeterson commented 9 years ago

I'm finishing building complex/SerialMesh now, will let you know what I find...

roystgnr commented 9 years ago

complex doesn't seem to matter (except that it hid the initial regression, since the other adjoints_ex3 complex regression was cropping up before we ever got to reduced_basis), but ParallelMesh does. Changing Mesh to SerialMesh fixes things for me, and you'll probably need to change Mesh to ParallelMesh to trigger the failure.

jwpeterson commented 9 years ago

Yes, I haven't reproduced the ParallelMesh error yet, but I can confirm that the code runs fine on 2 processors with SerialMesh in debug mode with --enable-complex.

So the reader is almost certainly doing something that doesn't work with ParallelMesh. My first thought was adding Points/Elems without IDs, but that doesn't seem to be the case. Is the reader being called only on processor 0 or on all processors?

roystgnr commented 9 years ago

For GMsh we call the read on all processors, but under the hood that means that processor 0 does all the work and then broadcasts the complete mesh to the rest.

roystgnr commented 9 years ago

Removing both of the "Physical Line" declarations in horn.geo and regenerating seems to fix the error?

roystgnr commented 9 years ago

But with only Physical Line 1 and Physical Surface 1 (and their dependencies) I get the same error, this time with only 427 elements.

roystgnr commented 9 years ago

Now, how to tell Gmsh that this domain can be meshed with only three triangles and a line...

jwpeterson commented 9 years ago

IIRC, "Physical Line/Surface" are the things that add the lower-dimensional elements with boundary information... so if removing them makes the bug go away, the issue must be something with the BoundaryInfo object in ParallelMesh?

roystgnr commented 9 years ago

Bumping up desired element size and getting rid of the mid-side node gets Gmsh down to "5 vertices 12 elements" which I assume means 4 triangles and 8 edges and a mid-box vertex.

I still get the failure here, and now we have processor 0 with 5 elements and processor 1 with 4 elements. It looks like the lower-dimensional edge might be getting added as more than just a boundary element?

jwpeterson commented 9 years ago

It looks like the lower-dimensional edge might be getting added as more than just a boundary element?

They definitely do get added in the initial pass of the reader, then subsequently deleted. So perhaps the delete step is not working with ParallelMesh, and then only the high-dimensional elements get broadcast, resulting in the mismatch?

roystgnr commented 9 years ago

Hmm... I think I'm on the trail now. Thanks for the help; I'll try and get a PR ready after lunch.

roystgnr commented 9 years ago

Fixed by #471

dknez commented 9 years ago

Sorry, I was travelling, just saw this now. I did add this example, and it looks like you guys fixed the GMSH issue here already, thanks.

roystgnr commented 9 years ago

All fixed, yeah. And the bug turned out to be wholly my fault, so no worries about not getting sucked into the vortex of fixing it; thanks for adding enough test coverage that we noticed it.