idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.77k stars 1.05k forks source link

Test Failures on Falcon #7387

Closed friedmud closed 5 years ago

friedmud commented 8 years ago

Description of the enhancement or error report

I'm seeing 4 test failures on Falcon:

time_steppers/timesequence_stepper.timesequence_failed...................................... FAILED (EXODIFF)
materials/boundary_material.bnd_coupling_vol................................................ FAILED (EXODIFF)
vectorpostprocessors/elements_along_plane.2d................................................ FAILED (CSVDIFF)
vectorpostprocessors/elements_along_plane.3d................................................ FAILED (CSVDIFF)

I'm using the brand newest libmesh (from today) and I'm disabling TBB (for OpenMP) but neither of those should be effecting these tests (I'm not running the tests with threads).

Can anyone else confirm / deny this?

Rationale for the enhancement or information for reproducing the error

Run the tests on Falcon

Identified impact

Fix the tests

brianmoose commented 8 years ago

Can confirm. Those same 4 tests fail for me as well.

dschwen commented 8 years ago

The exact same tests also fail on mac with clang and -Ofast -march=haswell.

bwspenc commented 7 years ago

I'm going to take the liberty of hijacking this issue to expand its scope to also include fixing modules tests that are also broken on Falcon.

This issue is over a year old, and 3 of those tests are still broken:

materials/boundary_material.bnd_coupling_vol................................................ FAILED (EXODIFF)
vectorpostprocessors/elements_along_plane.3d................................................ FAILED (CSVDIFF)
vectorpostprocessors/elements_along_plane.2d................................................ FAILED (CSVDIFF)

Also, there's a much larger set of broken modules tests (46):

tensor_mechanics/test:2D_geometries.plane_strain............................................ FAILED (EXODIFF)
tensor_mechanics/test:capped_weak_plane.pull_and_shear_1step................................ FAILED (EXODIFF)
tensor_mechanics/test:CylindricalRankTwoAux.test............................................ FAILED (EXODIFF)
tensor_mechanics/test:material_limit_time_step/elas_plas.nl1_lim............................ FAILED (EXODIFF)
tensor_mechanics/test:smeared_cracking.cracking_function.................................... FAILED (EXODIFF)
solid_mechanics/test:material_limit_time_step/elas_plas.nl1_lim............................. FAILED (EXODIFF)
heat_conduction/test:meshed_gap_thermal_contact.annulus..................................... FAILED (EXODIFF)
combined/test:gap_heat_transfer_htonly.RZ................................................... FAILED (EXODIFF)
combined/test:gap_heat_transfer_htonly.RSpherical........................................... FAILED (EXODIFF)
combined/test:inelastic_strain/elas_plas.elastic_plastic_sm................................. FAILED (EXODIFF)
phase_field/test:mobility_derivative.mobility_derivative_direct_coupled_test................ FAILED (EXODIFF)
combined/test:eigenstrain.inclusion......................................................... FAILED (EXODIFF)
phase_field/test:actions.conserved_split_1var_variable_mob.................................. FAILED (EXODIFF)
richards/test:excav.ex01.................................................................... FAILED (EXODIFF)
phase_field/test:mobility_derivative.mobility_derivative_direct_test........................ FAILED (EXODIFF)
porous_flow/test:mass_conservation.mass06.................................................... FAILED (ERRMSG)
porous_flow/test:mass_conservation.mass05.................................................... FAILED (ERRMSG)
porous_flow/test:aux_kernels.darcy_velocity.................................................. FAILED (ERRMSG)
porous_flow/test:dirackernels.squarepules.................................................... FAILED (ERRMSG)
porous_flow/test:energy_conservation.heat05.................................................. FAILED (ERRMSG)
porous_flow/test:dirackernels.theis3......................................................... FAILED (ERRMSG)
porous_flow/test:fluidstate.theis............................................................ FAILED (ERRMSG)
combined/test:internal_volume.rz_cone........................................................ FAILED (ERRMSG)
tensor_mechanics/test:tensile.small_deform2_update.......................................... FAILED (CSVDIFF)
tensor_mechanics/test:capped_mohr_coulomb.small12........................................... FAILED (CSVDIFF)
tensor_mechanics/test:capped_mohr_coulomb.small2............................................ FAILED (CSVDIFF)
tensor_mechanics/test:capped_mohr_coulomb.small24........................................... FAILED (CSVDIFF)
porous_flow/test:relperm.unity.............................................................. FAILED (CSVDIFF)
porous_flow/test:relperm.corey2............................................................. FAILED (CSVDIFF)
porous_flow/test:relperm.brookscorey2....................................................... FAILED (CSVDIFF)
porous_flow/test:relperm.corey4............................................................. FAILED (CSVDIFF)
porous_flow/test:relperm.corey1............................................................. FAILED (CSVDIFF)
porous_flow/test:relperm.brookscorey1....................................................... FAILED (CSVDIFF)
porous_flow/test:relperm.vangenuchten2...................................................... FAILED (CSVDIFF)
porous_flow/test:relperm.corey3............................................................. FAILED (CSVDIFF)
porous_flow/test:relperm.vangenuchten1...................................................... FAILED (CSVDIFF)
porous_flow/test:dirackernels.bh05.......................................................... FAILED (CSVDIFF)
porous_flow/test:capillary_pressure.vangenuchten2........................................... FAILED (CSVDIFF)
porous_flow/test:capillary_pressure.brookscorey1............................................ FAILED (CSVDIFF)
porous_flow/test:capillary_pressure.brookscorey2............................................ FAILED (CSVDIFF)
porous_flow/test:capillary_pressure.vangenuchten1........................................... FAILED (CSVDIFF)
porous_flow/test:capillary_pressure.vangenuchten3........................................... FAILED (CSVDIFF)
tensor_mechanics/test:static_deformations.beam_cosserat_01.................................. FAILED (CSVDIFF)
solid_mechanics/test:line_material_symm_tensor_sampler.test................................. FAILED (CSVDIFF)
tensor_mechanics/test:line_material_rank_two_sampler.rank_two_sampler....................... FAILED (CSVDIFF)
tensor_mechanics/test:line_material_rank_two_sampler.rank_two_scalar_sampler................ FAILED (CSVDIFF)
bwspenc commented 7 years ago

Just like we've been fixing the BISON tests on falcon, we should fix these as well.

@brianmoose Could you help us add a hpc test target for moose (if it doesn't already exist)?

Tagging developers: @acasagran @gardnerru @giopastor @jasondhales @dschwen @sapitts @jiangwen84 @novasr @hoffwm @permcody @sveerara

@WilkAndy : A number of these tests are yours. I don't know whether you have access to this machine. I'll send you the log file, and maybe by just looking at that you could make a pretty good guess at what needs to be changed to make them a little more platform-difference-proof.

Not sure what it would take to do this, but I'd love to see some of our build boxes replicate the falcon environment, and add that to our set of test targets that get run on every PR.

brianmoose commented 7 years ago

I added an optional "Test hpc" for moose PRs. I also added it to the moose nightly testing so we can keep track of it.

WilkAndy commented 7 years ago

Wow, that list is pretty disastrous for me! I don't have access for Falcon, so you'll have to send me the log file.

WilkAndy commented 7 years ago

Oh, i just go the email with the log file in it - thanks.

WilkAndy commented 7 years ago

After inspecting the log file, there are no real problems here. It'll be a bit harder to fix without access to FALCON, but i'll have to just trial-and-error on the test boxes.

One thing i don't understand: When the test is known to have a residual of zero, all other computers give |R|=1E-15 or exactly |R|=0, but FALCON sometimes gives |R|=1E-8 or something "large" like that. Why?!? Is it an error we can't control (eg, something is compiled in single-precision) or an error in my code?

For my future reference, here's a run down of MY tests (didn't look at other people's):

Failed due to tolerance problems, but probably nothing difficult to fix: all tensormechanics tests; dirac bh05; richards excav ex01

Failed due to initial residual not being zero or virtually zero (why? is FALCON doing something crazy?): all porousflow relperm; all mass conservation; all energy conservation; all capillary pressure; dirac kernels squarepulse; aux kernels darcy velocity

Failed due to snes tolerances not being set well: dirackernels theis3; fluidstate theis.

permcody commented 7 years ago

@WilkAndy - thanks for taking a look. I'm not sure I have an answer for you. We use MVAPICH and GCC/4.9.2 which is of course different than our build boxes. Just about everything is a little different but it shouldn't be so different that it causes us to pull our hair out. I'm happy to see Ben kicking this a bit, if it's a priority and we have to really dig in, we'll learn something and can hopefully make recommendations to others on how to fix this problems.

I know I've offered this before, but if you'd like a token so that you can access our machines, I'm more than happy to send one to you. @brianmoose made an optional Falcon test target you can turn on if you decide you want to look.

WilkAndy commented 7 years ago

@bwspenc and @permcody (and anyone else who's keen!) Before changing my tests, do you think you could explore this good example:

porous_flow/test/tests/relperm/unity.i

This has two variables, p0 and s1. p0=1E6, and s1=x initially. They both use the Diffusion Kernel only. FALCON's first nonlinear residual is 6.69E-09 !! On my other machines, the first nonlinear residual is O(E-15), as you'd expect. I get the feeling i'm missing something, but FALCON's result is suggestive of reduced-precision arithmetic to me, or that FALCON is somehow mixing the p=O(1E6) with s=O(1). (The rest of the test is just using postprocessors to check various material properties are correct.)

WilkAndy commented 7 years ago

@permcody, thanks for the offer of the token, but i know you guys take security quite seriously and i don't really want to have that responsibility. Easier to use the testing system.

bwspenc commented 7 years ago

The comment from @dschwen about the fact that using the -Ofast -march=haswell options caused those same tests to fail on his Mac. It's been so long since I've messed with compile options that I can't remember how to do it off the top of my head, but I'd be interested to see if we see similar results with these modules tests. Does anyone know if we are using more aggressive optimization options on Falcon than on other machines?

permcody commented 6 years ago

@milljm - Do you know what optimizations options are in use for moose-dev-gcc on Falcon?

milljm commented 6 years ago
[falcon1][~]> which mpicc
/apps/local/easybuild/software/MVAPICH2/2.0.1-GCC-4.9.2/bin/mpicc
[falcon1][~]> mpicc -show
gcc -O2 -march=native -I/apps/local/easybuild/software/MVAPICH2/2.0.1-GCC-4.9.2/include -L/apps/local/easybuild/software/MVAPICH2/2.0.1-GCC-4.9.2/lib -Wl,-rpath -Wl,/apps/local/easybuild/software/MVAPICH2/2.0.1-GCC-4.9.2/lib -lmpich -lopa -lmpl

I didn't really build this target, so its hard to pull up that sort of information...

permcody commented 6 years ago

So the MPICH wrapper isn't doing anything crazy, but that doesn't mean that hypre and petsc don't have extra flags tossed in...

bwspenc commented 6 years ago

Is there an option to use when building to show what the actual compile commands are?

permcody commented 6 years ago

"make -n" will show you what is being run. The output is enormous but you can search through it.

On Mon, Nov 13, 2017 at 11:27 AM Ben Spencer notifications@github.com wrote:

Is there an option to use when building to show what the actual compile commands are?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/idaholab/moose/issues/7387#issuecomment-344012256, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5XIJg5nUngS9JtG0eviys3N6toNeoVks5s2IoggaJpZM4JTdzn .

milljm commented 6 years ago

PETSc was configured as such:

./configure \
--prefix=$PACKAGES_DIR/petsc/petsc-3.6.3-hypre \
--with-hypre-dir=$PACKAGES_DIR/hypre/hypre-2.10.0b-p4 \
--with-ssl=0 \
--with-debugging=no \
--with-pic=1 \
--with-shared-libraries=1 \
--with-cc=mpicc \
--with-cxx=mpicxx \
--with-fc=mpif90 \
--download-fblaslapack=1 \
--download-metis=1 \
--download-parmetis=1 \
--download-superlu_dist=1 \
--download-scalapack=1 \
--download-mumps=1 \
CC=mpicc CXX=mpicxx FC=mpif90 F77=mpif77 F90=mpif90 \
CFLAGS='-fPIC -fopenmp' \
CXXFLAGS='-fPIC -fopenmp' \
FFLAGS='-fPIC -fopenmp' \
FCFLAGS='-fPIC -fopenmp' \
F90FLAGS='-fPIC -fopenmp' \
F77FLAGS='-fPIC -fopenmp' \
PETSC_DIR=`pwd`

HYPRE: Unfortunately, I do not have the exact build recipe I used (I am not seeing an old hypre-2.10.0b-p4 build directory). However, the configuration arguments I most likely used was (excluding prefix of course):

./configure --prefix=/apps/moose/gnu/hypre/hypre-2.10.1 \
--with-blas-libs=  \
--with-blas-lib-dir= \
--with-lapack-libs= \
--with-lapack-lib-dir= \
--with-blas=yes  \
--with-lapack=yes \
--with-LDFLAGS=-fopenmp \
--with-openmp \
--enable-bigint \
CC=mpicc CXX=mpicxx FC=mpif90 F77=mpif77 F90=mpif90 \
CFLAGS='-O3 -fopenmp -fPIC' CXXFLAGS='-O3 -fopenmp \
-fPIC' FFLAGS='-O3 -fopenmp -fPIC' FCFLAGS='-O3 -fopenmp \
-fPIC' F90FLAGS='-O3 -fopenmp -fPIC' F77FLAGS='-O3 -fopenmp -fPIC'

Back when I was building HYPRE manually, there was really only way to do it. I could look into this a bit more. This target is a bit odd, in the fact that I did build HYPRE separately. Normally we allow PETSc to build it. I believe @YaqiWang asked for this specific build when we found HYPRE was failing at a specific version PETSc 3.6.3 was pulling.

bwspenc commented 6 years ago

Here's a section of the compile flags I get on Falcon (moose-dev-gcc):

libtool  --tag=CXX  --mode=compile --quiet mpicxx -DNDEBUG  -DTENSOR_MECHANICS_ENABLED -std=gnu++11 -O2 -felide-constructors -funroll-loops -fstrict-aliasing -Wdisabled-optimization -fopenmp -DMETHOD=opt -Werror=return-type -Werror=reorder -Woverlength-strings

and here's what I get for the same file on my mac with clang:

libtool  --tag=CXX  --mode=compile --quiet ccache clang++ -DNDEBUG  -DTENSOR_MECHANICS_ENABLED -std=gnu++11 -O2 -felide-constructors -Qunused-arguments -Wunused-parameter -Wunused -fopenmp -DMETHOD=opt -Werror=return-type -Werror=reorder -Woverlength-strings

I see two extra flags on falcon that might be doing something: -funroll-loops -fstrict-aliasing. My guess is that the loop unrolling is probably not affecting things. I don't have any experience with strict aliasing, but from a quick web search, it seems controversial. Anyone know anything about that?

bwspenc commented 6 years ago

Actually, it looks like gcc includes -fstrict-aliasing in -O2, so including that separately shouldn't even have an effect since we're already using -O2.

brianmoose commented 6 years ago

Another data point, compiling MOOSE on falcon with PETSc/3.7.5-intel-2017.01 Python/2.7.13-intel-2017.01 modules: framework tests pass, unit tests pass, and the following modules tests failed:

tensor_mechanics/test:smeared_cracking.cracking_function.................................... FAILED (EXODIFF)
richards/test:pressure_pulse.pp_fu_lumped_22................................................ FAILED (EXODIFF)
combined/test:gap_heat_transfer_htonly.RSpherical........................................... FAILED (EXODIFF)

I couldn't get water_stream_eos to compile due to some fortran errors so that was not run (and was removed from the combined executable).

bwspenc commented 5 years ago

I think we can almost close this issue. With the environment we are using for our testing (PETSc 3.7.6), all of the tests pass. However, with the moose-dev-gcc module (PETSc 3.6.3), this one test still fails:

functional_expansion_tools/test:standard_use.interface_coupling .... [min_threads=2,FINISHED] FAILED (TIMEOUT)

I'm curious how much breakage we have with the 3.8.3 and 3.9.4 versions of PETSc...

bwspenc commented 5 years ago

I ran the tests with PETSc 3.8.3 and 3.9.4 on falcon, and they all pass. I'm just setting a min petsc version on this test to close this issue.