idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.76k stars 1.05k forks source link

CUBIT cmd causes mpi abort at MooseArray.h #4852

Closed mangerij closed 9 years ago

mangerij commented 9 years ago

Make any multi-block mesh in CUBIT, mesh it with tets, and then use 'refine volume # depth #' once, save the mesh and attempt to run this with more than 8 mpi processes and you will get a seg fault 11 that points to this:

Assertion `i < _size' failed
Access out of bounds in MooseArray (i: 0 size: 0)
at /home/john/projects/moose/framework/include/utils/MooseArray.h, line 289
[12] /home/john/projects/moose/framework/include/utils/MooseArray.h, line 289, compiled Mar 24 2015 at 13:52:44

Interestingly enough, this mesh can be computing on for one mpi process. Not entirely sure how to upload the mesh file in question but I'm essentially using the following simple .jou commands:

set node constraint on
create brick width 100 height 100 length 100
create sphere radius 15
subtract volume 2 from volume 1
create sphere radius 15
merge surface all with surface all
compress all
volume 1 size 15
volume 1 scheme tetmesh
mesh volume 1
volume 2 size 15
volume 2 scheme Tetmesh
mesh volume 2
refine volume 2 depth 2
sideset 1 surface 1
sideset 2 surface 2
sideset 3 surface 3
sideset 4 surface 4
sideset 5 surface 5
sideset 6 surface 6
sideset 7 surface 7
block 1 volume 1
block 2 volume 2
block all element type tetra4
set large exodus file off
export Genesis "./sphere_fine_medium_coarse_exodus.e" dimension 3 block all overwrite

Note that removing the line refine volume 2 depth 2 allows this mesh file to be run with 16 mpi processes. heal analyze also shows 100% quality for the mesh. No negative jacobians, ect.

jwpeterson commented 9 years ago

What is your input file?

mangerij commented 9 years ago

This should work on our master branch here: https://bitbucket.org/mesoscience/ferret

The kernels on my working branch are commented out here but here is the input file:

[Mesh]
  file = sphere_fine_medium_coarse_exodus.e
  #uniform_refine=1
[]
[Variables]
  [./polar_x]
    order = FIRST
    family = LAGRANGE
    block='2'
  [../]
  [./polar_y]
    order = FIRST
    family = LAGRANGE
    block='2'
  [../]
  [./polar_z]
    order = FIRST
    family = LAGRANGE
    block='2'
  [../]
  [./potential_int]
    order=FIRST
    family = LAGRANGE
  [../]
  [./potential_ext]
    order=FIRST
    family = LAGRANGE
  [../]
[]

[Kernels]
  [./polar_electric_E]
     type=PolarElectricEStrong
     variable=potential_int
     block='2'
    # permittivity = 8.85*e-12
     permittivity = 1
     polar_x = polar_x
     polar_y = polar_y
     polar_z = polar_z
     #implicit=false
  [../]
  [./diffusion_E]
     type=Electrostatics
     #permittivity = 3*8.85*e-12
     permittivity = 4
     variable=potential_int
     block='1 2'
  [../]
  [./diffusion_E_Ext]
     type=Electrostatics
     #type=Diffusion
  #   permittivity = 8.85*e-12
     permittivity = 1
     variable=potential_ext
     block='1 2'
  [../]
  [./polar_electric_px]
     type=PolarElectricPStrong
     variable=polar_x
     potential_ext = potential_ext
     potential_int = potential_int
     component=0
     #implicit=false
  [../]
  [./polar_electric_py]
     type=PolarElectricPStrong
     variable=polar_y
     potential_ext = potential_ext
     potential_int = potential_int
     component=1
     #implicit=false
  [../]
  [./polar_electric_pz]
     type=PolarElectricPStrong
     variable=polar_z
     potential_ext = potential_ext
     potential_int = potential_int
     component=2
     #implicit=false
  [../]
  [./polar_x_time]
     type=TimeDerivative
     variable=polar_x
  [../]
  [./polar_y_time]
     type=TimeDerivative
     variable=polar_y
  [../]
  [./polar_z_time]
     type=TimeDerivative
     variable=polar_z
  [../]
[]

[BCs]
  [./potential_ext_1]
    type = NeumannBC
    variable = potential_ext
    boundary = '1'
    value = 1.0
  [../]
  [./potential_ext_2]
    type = NeumannBC
    variable = potential_ext
    boundary = '2'
    value = -1.0
  [../]
  [./potential_ext_3]
    type = NeumannBC
    variable = potential_ext
    boundary = '3'
    value = 0.0
  [../]
  [./potential_ext_4]
    type = NeumannBC
    variable = potential_ext
    boundary = '4'
    value = 0.0
  [../]
   [./potential_ext_5]
    type = NeumannBC
    variable = potential_ext
    boundary = '5'
    value = 0.0
  [../]
  [./potential_ext_6]
    type = NeumannBC
    variable = potential_ext
    boundary = '6'
    value = 0.0
  [../]
  [./potential_int_1]
    type = NeumannBC
    variable = potential_int
    boundary = '1'
    value = 0
  [../]
  [./potential_int_2]
    type = NeumannBC
    variable = potential_int
    boundary = '2'
    value = 0
  [../]
  [./potential_int_3]
    type = NeumannBC
    variable = potential_int
    boundary = '3'
    value = 0
  [../]
  [./potential_int_4]
    type = NeumannBC
    variable = potential_int
    boundary = '4'
    value = 0
  [../]
  [./potential_int_5]
    type = NeumannBC
    variable = potential_int
    boundary = '5'
    value = 0
  [../]
  [./potential_int_6]
    type = NeumannBC
    variable = potential_int
    boundary = '6'
    value = 0
  [../]
[]

[ICs]
  active='polar_x_constic polar_y_constic polar_z_constic'
  [./polar_x_constic]
     type=ConstantIC
     variable=polar_x
     block = '2'
     value=0.3
  [../]
  [./polar_y_constic]
     type=ConstantIC
     variable=polar_y
     block = '2'
     value=0.3
  [../]
  [./polar_z_constic]
     type=ConstantIC
     variable=polar_z
     block = '2'
     value=0.3
  [../]
[]

[Preconditioning]
   [./smp]
     type=SMP
     full=true   #to use every off diagonal block
     pc_side=left
   [../]
[]

[Executioner]
  #type = Steady
  type=Transient
  solve_type=newton
  scheme=implicit-euler     #"implicit-euler, explicit-euler, crank-nicolson, bdf2, rk-2"
  dt=1e0
 # nl_max_its=30
 # l_max_its=10000
  num_steps=120
  #petsc_options="-snes_monitor -snes_converged_reason -ksp_monitor -ksp_converged_reason"
 # petsc_options='-snes_monitor -snes_converged_reason -ksp_monitor -ksp_converged_reason'
  petsc_options='-ksp_monitor_true_residual -snes_monitor -snes_view -snes_converged_reason -snes_linesearch_monitor -options_left'
  petsc_options_iname='-gmres_restart -ksp_type  -pc_type -snes_linesearch_type -pc_factor_zeropivot'
  petsc_options_value='1000               gmres     jacobi       basic                1e-50'
  #petsc_options_iname='-snes_rtol'
  #petsc_options_value='1e-16'
[]

[Outputs]
  file_base = outlin_die_sph_strong_implic_dt0_n80_er4_E0-1
  output_initial = true
  print_linear_residuals = true
  print_perf_log = true
  [./out]
    type = Exodus
    elemental_as_nodal = true
    #output_nodal = true
  [../]
[]
permcody commented 9 years ago

Holy Dooley!

We need to narrow this down, is the problem with the MOOSE, or your application? Have you tried running this mesh with just a Diffusion kernel and Dirichlet boundary conditions? That'll tell us if it's a mesh related problem. Then we can start added more variables and models to find the problem. Even with a stack trace we may not find the smoking gun.

Cody

On Tue, Mar 24, 2015 at 2:54 PM John notifications@github.com wrote:

This should work on our master branch here: https://bitbucket.org/mesoscience/ferret

The kernels on my working branch are commented out here but here is the input file:

[Mesh] file = sphere_fine_medium_coarse_exodus.e

uniform_refine=1

[] [Variables] [./polar_x] order = FIRST family = LAGRANGE block='2' [../] [./polar_y] order = FIRST family = LAGRANGE block='2' [../] [./polar_z] order = FIRST family = LAGRANGE block='2' [../] [./potential_int] order=FIRST family = LAGRANGE [../] [./potential_ext] order=FIRST family = LAGRANGE [../] []

[AuxVariables]

[./SurfCharge] order = FIRST family = LAGRANGE [../] [./Ex] order = CONSTANT family = MONOMIAL [../] [./Ey] order = CONSTANT family = MONOMIAL [../] [./Ez] order = CONSTANT family = MONOMIAL [../] [./depol_Ex] order = CONSTANT family = MONOMIAL [../] [./depol_Ey] order = CONSTANT family = MONOMIAL [../] [./depol_Ez] order = CONSTANT family = MONOMIAL [../]

[]

[Kernels] [./polar_electric_E] type=PolarElectricEStrong variable=potential_int block='2'

permittivity = 8.85

_e-12 permittivity = 1 polar_x = polar_x polar_y = polar_y polar_z = polar_z #implicit=false [../] [./diffusion_E] type=Electrostatics

permittivity = 3_8.85

_e-12 permittivity = 4 variable=potential_int block='1 2' [../] [./diffusion_E_Ext] type=Electrostatics #type=Diffusion # permittivity = 8.85_e-12 permittivity = 8.85*e-12 variable=potential_ext block='1 2' [../] [./polar_electric_px] type=PolarElectricPStrong variable=polar_x potential_ext = potential_ext potential_int = potential_int component=0

implicit=false

[../] [./polar_electric_py] type=PolarElectricPStrong variable=polar_y potential_ext = potential_ext potential_int = potential_int component=1

implicit=false

[../] [./polar_electric_pz] type=PolarElectricPStrong variable=polar_z potential_ext = potential_ext potential_int = potential_int component=2

implicit=false

[../] [./polar_x_time] type=TimeDerivative variable=polar_x [../] [./polar_y_time] type=TimeDerivative variable=polar_y [../] [./polar_z_time] type=TimeDerivative variable=polar_z [../] []

[AuxKernels]

[./surfacechargeaux] type = SurfaceChargeAux variable = SurfCharge boundary = '7' polar_x = polar_x polar_y = polar_y polar_z = polar_z [../] [./Ex_fieldAux] type = Ex_fieldAux variable = Ex potential_int = potential_int potential_ext = potential_ext [../] [./Ey_fieldAux] type = Ey_fieldAux variable = Ey potential_int = potential_int potential_ext = potential_ext [../] [./Ez_fieldAux] type = Ez_fieldAux variable = Ez potential_int = potential_int potential_ext = potential_ext [../] [./Depol_x_fieldAux] type = Depol_x_fieldAux variable = depol_Ex block = '2' potential_int = potential_int [../] [./Depol_y_fieldAux] type = Depol_y_fieldAux variable = depol_Ey block = '2' potential_int = potential_int [../] [./Depol_z_fieldAux] type = Depol_z_fieldAux variable = depol_Ez block = '2' potential_int = potential_int [../]

[]

[BCs] [./potential_ext_1] type = NeumannBC variable = potential_ext boundary = '1' value = 1.0 [../] [./potential_ext_2] type = NeumannBC variable = potential_ext boundary = '2' value = -1.0 [../] [./potential_ext_3] type = NeumannBC variable = potential_ext boundary = '3' value = 0.0 [../] [./potential_ext_4] type = NeumannBC variable = potential_ext boundary = '4' value = 0.0 [../] [./potential_ext_5] type = NeumannBC variable = potential_ext boundary = '5' value = 0.0 [../] [./potential_ext_6] type = NeumannBC variable = potential_ext boundary = '6' value = 0.0 [../] [./potential_int_1] type = NeumannBC variable = potential_int boundary = '1' value = 0 [../] [./potential_int_2] type = NeumannBC variable = potential_int boundary = '2' value = 0 [../] [./potential_int_3] type = NeumannBC variable = potential_int boundary = '3' value = 0 [../] [./potential_int_4] type = NeumannBC variable = potential_int boundary = '4' value = 0 [../] [./potential_int_5] type = NeumannBC variable = potential_int boundary = '5' value = 0 [../] [./potential_int_6] type = NeumannBC variable = potential_int boundary = '6' value = 0 [../] []

[ICs] active='polar_x_constic polar_y_constic polar_z_constic' [./polar_x_constic] type=ConstantIC variable=polar_x block = '2' value=0.3 [../] [./polar_y_constic] type=ConstantIC variable=polar_y block = '2' value=0.3 [../] [./polar_z_constic] type=ConstantIC variable=polar_z block = '2' value=0.3 [../] []

[Preconditioning] [./smp] type=SMP full=true #to use every off diagonal block pc_side=left [../] []

[Executioner]

type = Steady

type=Transient solve_type=newton scheme=implicit-euler #"implicit-euler, explicit-euler, crank-nicolson, bdf2, rk-2" dt=1e0

nl_max_its=30

l_max_its=10000

num_steps=120

petsc_options="-snes_monitor -snes_converged_reason -ksp_monitor

-ksp_converged_reason"

petsc_options='-snes_monitor -snes_converged_reason -ksp_monitor

-ksp_converged_reason' petsc_options='-ksp_monitor_true_residual -snes_monitor -snes_view -snes_converged_reason -snes_linesearch_monitor -options_left' petsc_options_iname='-gmres_restart -ksp_type -pc_type -snes_linesearch_type -pc_factor_zeropivot' petsc_options_value='1000 gmres jacobi basic 1e-50'

petsc_options_iname='-snes_rtol'

petsc_options_value='1e-16'

[]

[Debug]

show_parser = true

[]

[Outputs] file_base = outlin_die_sph_strong_implic_dt0_n80_er4_E0-1 output_initial = true print_linear_residuals = true print_perf_log = true [./out] type = Exodus elemental_as_nodal = true

output_nodal = true

[../] []

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-85709226.

mangerij commented 9 years ago

Good call Cody,

Seems like it isn't a mesh related issue. Using this mesh file in the diffusion example worked for 1 and 16 processes.

I'm not sure why I'm not getting a back trace from gdb or valgrind however, just points to this error in MooseArray.h

permcody commented 9 years ago

MooseArray is aborting. Try adding a breakpoint at MPI_Abort and then running. You should be able to get a stack trace.

Cody

On Tue, Mar 24, 2015 at 3:47 PM John notifications@github.com wrote:

Good call Cody,

Seems like it isn't a mesh related issue. Using this mesh file in the diffusion example worked for 1 and 16 processes.

I'm not sure why I'm not getting a back trace from gdb however, just points to this error in MooseArray.h

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-85723015.

mangerij commented 9 years ago

Yeah, I'm not sure what is going on. I tried that and it just says 'No stack.'

permcody commented 9 years ago

Well then you didn't hit the right breakpoint just yet. Try breaking on the error in MooseArray so you can halt the execution before it exits. On Tue, Mar 24, 2015 at 4:11 PM John notifications@github.com wrote:

Yeah, I'm not sure what is going on. I tried that and it just says 'No stack.'

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-85734739.

mangerij commented 9 years ago

I tried:

break Assertion `i < _size' failed Access out of bounds in MooseArray (i: 0 size: 0)
break Access out of bounds in MooseArray (i: 0 size: 0)
break  mooseAssert
break /home/john/projects/moose/framework/include/utils/MooseArray.h:289
break /home/john/projects/moose/framework/include/utils/MooseArray.h, line 289
break MPI_Abort
break MPI_abort
break MPI_ABORT

and nothing gave me a trace. I sorry, I'm a bit confused.

andrsd commented 9 years ago

Let's be explicit here. If you do:

$ gdb --args ./your_app-method -i input_file.i
(gdb) break MPI_Abort
(gdb) run
<observed a crash>
(gdb) bt 

you do not get any stack?

mangerij commented 9 years ago

Yup. I just checked again just to be safe. No stack.

andrsd commented 9 years ago

I forgot to mention, you used METHOD=dbg (i.e. debug mode), right. Not opt...

andrsd commented 9 years ago

So, I complied ferret master, generated the mesh using your script, used your input file, ran the thing with mpiexec -np 16 in both devel and opt and it just works...

permcody commented 9 years ago

While that may be true you wouldn't hit the assertion in those modes so there could still be a problem. By the way I double checked and was wrong about the breakpoint. It should be MPI_abort with a lower case "a". Try that instead.

Cody On Wed, Mar 25, 2015 at 3:53 PM David Andrs notifications@github.com wrote:

So, I complied ferret master, generated the mesh using your script, used your input file, ran the thing with mpiexec -np 16 in both devel and opt and it just works...

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-86244394.

dkarpeyev commented 9 years ago

@mangerij I can't reproduce this either. I'm guessing here, but perhaps there is broken build that causes memory corruption and triggers the assert? Sounds convoluted, and it probably is, but we need to figure out a way to reproduce it. Is the error message at the top of this thread complete? Is there any more information on where it fails?

andrsd commented 9 years ago

It should be MPI_abort with a lower case "a".

Uppercase A as far as I can tell: http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html

permcody commented 9 years ago

Oops, wrong one

On Thu, Mar 26, 2015 at 7:45 AM David Andrs notifications@github.com wrote:

It should be MPI_abort with a lower case "a".

Uppercase A as far as I can tell: http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-86545487.

andrsd commented 9 years ago

Just verified that dbg mode works on my machine as well... I'd recommend to try a clean build first (as @karpeev suggested)

mangerij commented 9 years ago

Yeah, I'll be doing that next week. Fresh install for petsc/libmesh/moose- new laptop.

I'm a bit confused as to why the mesh in question would run with the diffusion kernel and not with the added kernels in Ferret even if those kernels work on 16 mpi processes with different meshes on my machine or why the debugger is not giving a stack on the MPI_abort (I've tried all the suggestions above).

permcody commented 9 years ago

Have you verified that your application is valgrind clean? If you have access to clang you might try the Address Sanitizer.

On Thu, Mar 26, 2015 at 2:36 PM John notifications@github.com wrote:

Yeah, I'll be doing that next week. Fresh install for petsc/libmesh/moose- new laptop.

I'm a bit confused as to why the mesh in question would run with the diffusion kernel and not with the added kernels in Ferret even if those kernels work on 16 mpi processes with different meshes on my machine or why the debugger is not giving a stack on the MPI_abort (I've tried all the suggestions above).

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-86727835.

andrsd commented 9 years ago

Can you try a quick:

$ cd <ferret>
$ make cleanall
$ make

and see what you get? (unless you already tried that). Possibly executing update_and_rebuild_libmesh.sh script.

andrsd commented 9 years ago

If you have access to clang you might try the Address Sanitizer.

I have address sanitizer active for devel mode and I did not see anything yesterday when I tried it for the first time. I still suspect broken build...

dkarpeyev commented 9 years ago

Another possibility is a broken CUBIT, but I would start by rebuilding everything and making sure Ferret is valgrind-clean.

On Thu, Mar 26, 2015 at 4:42 PM David Andrs notifications@github.com wrote:

If you have access to clang you might try the Address Sanitizer.

I have address sanitizer active for devel mode and I did not see anything yesterday when I tried it for the first time. I still suspect broken build...

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-86729421.

permcody commented 9 years ago

Reopen if you are able to repeat this error or can submit us a test case. Thanks

mangerij commented 9 years ago

Yeah I plan on it. This is on my to-do list Thanks :)

On Mon, May 4, 2015 at 6:20 PM, Cody Permann notifications@github.com wrote:

Reopen if you are able to repeat this error or can submit us a test case. Thanks

— Reply to this email directly or view it on GitHub https://github.com/idaholab/moose/issues/4852#issuecomment-98867530.