Open WilkAndy opened 6 years ago
Just prompting someone to look at this...
I wonder whether this is related to Issue #11335 . When PR #11459 gets merged, i'll investigate a bit further
Nope, i just tried the line_sink03.i again. 9 times out of 10 the Jacobian was OK, but 1 out of 10 it wasn't. Bummer
I "valgrinded" porous flow using that line_sink03.i and didn't get anything useful. Just ten zillion valgrind messages but none of them had PorousFlow anywhere.
Not related to this problem but some valgrind help: Make sure that you use the suppression file when running valgrind so that you don't get any false positives. You can see how that's done by looking at the command the test harness runs when you use the --valgrind
flag. You really should have zero warnings or errors.
@jwpeterson could you take a look at this when you get some time? I believe you were in here last so maybe you'll see something quickly...
I can try... seems that the problem would be related to having more Dirac points in a given element than the number of quadrature points in the quadrature rule for a given element? @WilkAndy just curious if the code is more likely to fail the more Dirac points you have in a given element?
Yea, i think that more Dirac points = more chance of something going wrong.
If you have 10 mins to look at that line_sink03.i quickly then i'd appreciate it. If you don't see anything obvious then at least we know it's not super trivial.
Definitely something weird is going on. After modifying the test in question a bit as on this branch I find that on repeated runs I get at least three distinct solve histories with more or less equal frequency:
History 1
Time Step 0, time = 0
dt = 0
Time Step 1, time = 1
dt = 1
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 6.212737e+01
2 Linear |R| = 2.167926e+03
3 Linear |R| = 1.082510e+04
4 Linear |R| = 1.072475e+04
5 Linear |R| = 9.606824e+03
6 Linear |R| = 9.665973e+03
7 Linear |R| = 9.019453e+03
8 Linear |R| = 9.816942e+03
9 Linear |R| = 1.011788e+04
10 Linear |R| = 9.923565e+03
1 Nonlinear |R| = 6.218124e+01
0 Linear |R| = 6.218124e+01
1 Linear |R| = 6.215966e+01
2 Linear |R| = 1.043944e+03
3 Linear |R| = 3.813611e+03
4 Linear |R| = 4.486814e+03
5 Linear |R| = 5.430237e+03
6 Linear |R| = 2.814007e+04
7 Linear |R| = 4.104297e+04
8 Linear |R| = 4.281316e+04
9 Linear |R| = 4.847741e+04
10 Linear |R| = 5.138161e+04
Solve Did NOT Converge!
Time Step 1, time = 0.5
dt = 0.5
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 6.203097e+01
2 Linear |R| = 6.203097e+01
3 Linear |R| = 1.975839e+02
4 Linear |R| = 8.606747e+02
5 Linear |R| = 8.000082e+02
6 Linear |R| = 1.079152e+03
7 Linear |R| = 1.348130e+03
8 Linear |R| = 1.653580e+03
9 Linear |R| = 1.900038e+03
10 Linear |R| = 2.292019e+03
1 Nonlinear |R| = 6.204138e+01
0 Linear |R| = 6.204138e+01
1 Linear |R| = 6.202529e+01
2 Linear |R| = 4.664244e+03
3 Linear |R| = 5.844073e+03
4 Linear |R| = 1.228287e+04
5 Linear |R| = 7.853182e+04
6 Linear |R| = 1.118292e+05
7 Linear |R| = 1.232677e+05
8 Linear |R| = 2.788981e+05
9 Linear |R| = 4.490579e+05
10 Linear |R| = 2.855994e+05
Solve Did NOT Converge!
History 2
Time Step 0, time = 0
dt = 0
Time Step 1, time = 1
dt = 1
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 2.799012e-26
1 Nonlinear |R| = 6.192867e+01
0 Linear |R| = 6.192867e+01
1 Linear |R| = 8.031469e-26
2 Nonlinear |R| = 6.184053e+01
0 Linear |R| = 6.184053e+01
1 Linear |R| = 1.237857e-26
3 Nonlinear |R| = 6.173510e+01
0 Linear |R| = 6.173510e+01
1 Linear |R| = 5.290771e-27
4 Nonlinear |R| = 6.161104e+01
0 Linear |R| = 6.161104e+01
1 Linear |R| = 3.311840e-26
5 Nonlinear |R| = 6.147000e+01
0 Linear |R| = 6.147000e+01
1 Linear |R| = 5.900796e-26
6 Nonlinear |R| = 6.131480e+01
0 Linear |R| = 6.131480e+01
1 Linear |R| = 2.024754e-26
7 Nonlinear |R| = 6.114873e+01
0 Linear |R| = 6.114873e+01
1 Linear |R| = 3.149196e-26
8 Nonlinear |R| = 6.097491e+01
0 Linear |R| = 6.097491e+01
1 Linear |R| = 3.001229e-26
9 Nonlinear |R| = 6.079675e+01
0 Linear |R| = 6.079675e+01
1 Linear |R| = 1.145862e-25
10 Nonlinear |R| = 6.061728e+01
Solve Did NOT Converge!
Time Step 1, time = 0.5
dt = 0.5
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 1.565441e-26
1 Nonlinear |R| = 6.178824e+01
0 Linear |R| = 6.178824e+01
1 Linear |R| = 3.027155e-26
2 Nonlinear |R| = 6.161490e+01
0 Linear |R| = 6.161490e+01
1 Linear |R| = 1.332055e-25
3 Nonlinear |R| = 6.142760e+01
0 Linear |R| = 6.142760e+01
1 Linear |R| = 2.747005e-25
4 Nonlinear |R| = 6.122558e+01
0 Linear |R| = 6.122558e+01
1 Linear |R| = 6.339966e-25
5 Nonlinear |R| = 6.101088e+01
0 Linear |R| = 6.101088e+01
1 Linear |R| = 2.778841e-24
6 Nonlinear |R| = 6.078607e+01
0 Linear |R| = 6.078607e+01
1 Linear |R| = 2.724387e-26
7 Nonlinear |R| = 6.055365e+01
0 Linear |R| = 6.055365e+01
1 Linear |R| = 8.258255e-26
8 Nonlinear |R| = 6.031611e+01
0 Linear |R| = 6.031611e+01
1 Linear |R| = 6.396192e-26
9 Nonlinear |R| = 6.007548e+01
0 Linear |R| = 6.007548e+01
1 Linear |R| = 3.379630e-26
10 Nonlinear |R| = 5.983571e+01
Solve Did NOT Converge!
History 3
Time Step 0, time = 0
dt = 0
Time Step 1, time = 1
dt = 1
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 1.457812e+03
2 Linear |R| = nan
Solve Did NOT Converge!
Time Step 1, time = 0.5
dt = 0.5
0 Nonlinear |R| = 6.218387e+01
0 Linear |R| = 6.218387e+01
1 Linear |R| = 8.851223e+02
2 Linear |R| = nan
Solve Did NOT Converge!
Seems like there's definitely something going on here that valgrind would find, so I'm testing that next...
OK, i'm glad that you can see the non-repeatable strangeness!
@WilkAndy, running this particular test in dbg mode actually produces a lot of floating point exceptions. After investigating, these are apparently caused by multiplying 0 * Inf
, as that produces a NaN
by definition, and any math operation that produces a NaN throws a floating point exception. This seems to happen fairly regularly on the line of SimpleFluidProperties.C marked below:
115 {
116 rho = this->rho(pressure, temperature);
117 drho_dp = rho / _bulk_modulus;
-> 118 drho_dT = -_thermal_expansion * rho;
119 }
120
121 Real
since
(lldb) p _thermal_expansion
(libMesh::Real) $3 = 0
(lldb) p ::rho
(libMesh::Real) $4 = +Inf
Now, I suspect this happens because of some previous unrelated error which has caused there to be non-physical values of the pressure and temperature:
(lldb) p pressure
(libMesh::Real) $5 = 1617.4578675891707
(lldb) p temperature
(libMesh::Real) $6 = 1.8451729058342861
but it could also be because I'm using this input file in a way that was not intended (i.e. not as a -snes_type test
). If there are some changes I should make to the input file to generate a more physically realistic problem, please let me know...
You probably already figured it out by now, but to run this particular example through valgrind "manually", you can cd to the test directory and do:
valgrind --suppressions=$MOOSE_DIR/python/TestHarness/suppressions/errors.supp --leak-check=full --tool=memcheck --dsymutil=yes --track-origins=yes --demangle=yes -v ../../../porous_flow-dbg -i line_sink03.i
When I run this test through valgrind, I get lots of "Conditional jump or move depends on uninitialised value(s)" errors that look about like the following:
==21358== 2 errors in context 7 of 731:
==21358== Conditional jump or move depends on uninitialised value(s)
==21358== at 0x11B09E8D: PetscIsInfOrNanReal (mathinf.c:52)
==21358== by 0x123D0C69: KSPConvergedDefault (iterativ.c:778)
==21358== by 0xB9FD56A: Moose::PetscSupport::petscConverged(_p_KSP*, int, double, KSPConvergedReason*, void*) (PetscSupport.C:314)
==21358== by 0x12424D09: KSPSolve_BCGS (bcgs.c:140)
==21358== by 0x123C19EE: KSPSolve (itfunc.c:656)
==21358== by 0x124A7B19: SNESSolve_NEWTONLS (ls.c:230)
==21358== by 0x124D1808: SNESSolve (snes.c:4005)
==21358== by 0xFE5C171: libMesh::PetscNonlinearSolver<double>::solve(libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double, unsigned int) (petsc_nonlinear_solver.C:930)
==21358== by 0x100E5B6D: libMesh::NonlinearImplicitSystem::solve() (nonlinear_implicit_system.C:183)
==21358== by 0xADD66C9: TimeIntegrator::solve() (TimeIntegrator.C:52)
==21358== by 0xB79380C: NonlinearSystem::solve() (NonlinearSystem.C:184)
==21358== by 0xA1084EE: FEProblemBase::solve() (FEProblemBase.C:3782)
==21358== Uninitialised value was created by a heap allocation
==21358== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21358== by 0x53EEF7B: MooseArray<double>::resize(unsigned int) (MooseArray.h:206)
==21358== by 0x548496B: MaterialProperty<double>::resize(int) (MaterialProperty.h:223)
==21358== by 0xA5E8C2E: MaterialProperties::resizeItems(unsigned int) (MaterialProperty.h:298)
==21358== by 0xA59C4B5: MaterialData::resize(unsigned int) (MaterialData.C:34)
==21358== by 0xA13286D: FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) (FEProblemBase.C:2351)
==21358== by 0xB9746E1: ComputeDiracThread::onElement(libMesh::Elem const*) (ComputeDiracThread.C:86)
==21358== by 0xB83CB93: ThreadedElementLoopBase<libMesh::StoredRange<std::_Rb_tree_const_iterator<libMesh::Elem const*>, libMesh::Elem const*> >::operator()(libMesh::StoredRange<std::_Rb_tree_const_iterator<libMesh::Elem const*>, libMesh::Elem const*> const&, bool) (ThreadedElementLoopBase.h:197)
==21358== by 0xB791ECA: NonlinearSystemBase::computeDiracContributions(bool) (NonlinearSystemBase.C:2466)
==21358== by 0xB78EFF9: NonlinearSystemBase::computeResidualInternal(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (NonlinearSystemBase.C:1273)
==21358== by 0xB77D5DC: NonlinearSystemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (NonlinearSystemBase.C:570)
==21358== by 0xA1498B4: FEProblemBase::computeResidualTags(std::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (FEProblemBase.C:4179)
The bottom part of each error, which shows where the uninitialised variable is created on the heap within a MooseArray
, is basically the same in each case, and points to the culprit being FEProblemBase::reinitMaterials()
as called from NonlinearSystemBase::computeDiracContributions()
. The top part of the error is usually something down in PETSc corresponding to where the uninitialized values are actually used.
Anyway, just some information that I was able to gather... I'm still not really sure how to fix the problem so if anyone has ideas I'm open to suggestions.
I agree with you that the 0*Inf is probably a manifestation of something going wrong prior to this. The "0" is correct, but the "Inf" is almost definitely due to rho = exp(P/bulk), where P=1617 according to your debugger, and bulk=1.5 or 0.5 according to the input file.
Re-looking at this input file i can see a number of Dirac Kernels. dirac0 to dirac5 have only 1 Dirac Point each, and i've written that MOOSE produces no error with only them being active. However, as soon as dirac6 or dirac7 are made active MOOSE goes crazy. dirac6 has 9 Dirac Points and dirac7 has 10 Dirac Points. So perhaps this gives you some more to go on.
==29672== Conditional jump or move depends on uninitialised value(s)
==29672== at 0x1DB63302: __printf_fp (in /lib64/libc-2.22.so)
==29672== by 0x1DB5EE38: printf_positional (in /lib64/libc-2.22.so)
==29672== by 0x1DB60022: vfprintf (in /lib64/libc-2.22.so)
==29672== by 0x1DB89408: vsnprintf (in /lib64/libc-2.22.so)
==29672== by 0x123943AD: PetscVSNPrintf (in /apps/petsc/3.7.7-gcc-mpi-moose/lib/libpetsc.so.3.7.7)
==29672== by 0x12394730: PetscVFPrintfDefault (in /apps/petsc/3.7.7-gcc-mpi-moose/lib/libpetsc.so.3.7.7)
==29672== by 0x12396066: PetscPrintf (in /apps/petsc/3.7.7-gcc-mpi-moose/lib/libpetsc.so.3.7.7)
==29672== by 0x12AEAC3A: SNESSolve_Test (in /apps/petsc/3.7.7-gcc-mpi-moose/lib/libpetsc.so.3.7.7)
==29672== by 0x12B098B5: SNESSolve (in /apps/petsc/3.7.7-gcc-mpi-moose/lib/libpetsc.so.3.7.7)
==29672== by 0xD2B7E5B: libMesh::PetscNonlinearSolver<double>::solve(libMesh::SparseMatrix<double>&, libMesh::NumericVector<double>&, libMesh::NumericVector<double>&, double, unsigned int) (petsc_nonlinear_solver.C:930)
==29672== by 0xD331EFA: libMesh::NonlinearImplicitSystem::solve() (nonlinear_implicit_system.C:183)
==29672== by 0x9E4C229: TimeIntegrator::solve() (TimeIntegrator.C:46)
==29672== Uninitialised value was created by a heap allocation
==29672== at 0x4C29CDA: operator new[](unsigned long) (vg_replace_malloc.c:422)
==29672== by 0x54C031C: MooseArray<double>::resize(unsigned int) (MooseArray.h:206)
==29672== by 0x54BEB19: MaterialProperty<double>::resize(int) (MaterialProperty.h:223)
==29672== by 0x973E9AA: MaterialProperties::resizeItems(unsigned int) (MaterialProperty.h:298)
==29672== by 0x9734F07: MaterialData::resize(unsigned int) (MaterialData.C:34)
==29672== by 0x98515C5: FEProblemBase::reinitMaterials(unsigned short, unsigned int, bool) (FEProblemBase.C:2351)
==29672== by 0x93C91F3: ComputeDiracThread::onElement(libMesh::Elem const*) (ComputeDiracThread.C:86)
==29672== by 0x9D35406: ThreadedElementLoopBase<libMesh::StoredRange<__gnu_debug::_Safe_iterator<std::_Rb_tree_const_iterator<libMesh::Elem const*>, std::__debug::set<libMesh::Elem const*, std::less<libMesh::Elem const*>, std::allocator<libMesh::Elem const*> > >, libMesh::Elem const*> >::operator()(libMesh::StoredRange<__gnu_debug::_Safe_iterator<std::_Rb_tree_const_iterator<libMesh::Elem const*>, std::__debug::set<libMesh::Elem const*, std::less<libMesh::Elem const*>, std::allocator<libMesh::Elem const*> > >, libMesh::Elem const*> const&, bool) (ThreadedElementLoopBase.h:197)
==29672== by 0x9D1150E: NonlinearSystemBase::computeDiracContributions(bool) (NonlinearSystemBase.C:2466)
==29672== by 0x9D09E7D: NonlinearSystemBase::computeResidualInternal(std::__debug::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (NonlinearSystemBase.C:1273)
==29672== by 0x9D06466: NonlinearSystemBase::computeResidualTags(std::__debug::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (NonlinearSystemBase.C:570)
==29672== by 0x985D118: FEProblemBase::computeResidualTags(std::__debug::set<unsigned int, std::less<unsigned int>, std::allocator<unsigned int> > const&) (FEProblemBase.C:4179)
==29672==
So i'm getting the same error as you. But i don't really understand what it means and certainly not how to fix it. Is it that a Material property is not being correctly initialised?
Important? : There is no valgrind error if i just have dirac0 to dirac5 active. The valgrind error only occurs for dirac6 and dirac7 (with 9 and 10 Dirac Points repsectively).
Aside: i'm pretty sure i used to get the same problem in the "Richards" module where there were no such things as PorousFlow's bizarre "nodal Materials", which might point the finger back at the framework rather than PorousFlow itself.
What if an element has 2 DiracKernels active in it, and each have different number of Dirac points? Does this screw up the MaterialData resize?
To answer my question: i think having 2 differently-sized DiracKernels is fine as i think MOOSE just sums the number of DiracPoints in each element to find the number of quadpoints to use. So the reinit should be fine.
I have to work on something else for the rest of today. In summary, in the DiracKernels block:
if active = 'dirac0 dirac1 dirac2 dirac3 dirac4 dirac5'
then no valgrind errors and correct jacobian is obtained
if active = dirac7
(this has 10 Dirac points) then valgrind error as above (and incorrect jacobian)
Another fact:
type = FDP
then no valgrind errorsThe problem could lie in the lines https://github.com/idaholab/moose/blob/256b2a0c10679ed36ba780f36743bf4ae153a56b/modules/porous_flow/src/dirackernels/PorousFlowLineSink.C#L283-L329 . The reason for my suspecting this is:
type = FDP
; return
statement before them and suddenly i don't get any valgrind errors;use_*
parameters in my DiracKernel those lines don't get activated and similarly i don't get any valgrind errors.But i wonder why?
(*_relative_permeability)[_i][_qp])
is OK. It just seems things like (*_drelative_permeability_dvar)[_i][_ph][pvar])
are not working properly.Is the DerivativeMaterial stuff not playing well with Dirac Kernels with lots of Dirac points? Eg, in the constructor, this produces no problems:
_relative_permeability(
((_use_mobility && _has_mobility) ||
(_use_relative_permeability && _has_relative_permeability))
? &getMaterialProperty<std::vector<Real>>("PorousFlow_relative_permeability_nodal")
: nullptr),
but what i'm leaning towards is that this may cause a problem:
_drelative_permeability_dvar(((_use_mobility && _has_mobility) ||
(_use_relative_permeability && _has_relative_permeability))
? &getMaterialProperty<std::vector<std::vector<Real>>>(
"dPorousFlow_relative_permeability_nodal_dvar")
: nullptr),
Yep, the drelative_permeability_dvar
(and similar) is the problem.
It is constructed: https://github.com/idaholab/moose/blob/256b2a0c10679ed36ba780f36743bf4ae153a56b/modules/porous_flow/src/dirackernels/PorousFlowLineSink.C#L133-L137
The Material Property is declared in a rather roundabout way. First: https://github.com/idaholab/moose/blob/256b2a0c10679ed36ba780f36743bf4ae153a56b/modules/porous_flow/src/materials/PorousFlowRelativePermeabilityBase.C#L44-L49 Second: https://github.com/idaholab/moose/blob/256b2a0c10679ed36ba780f36743bf4ae153a56b/modules/porous_flow/src/materials/PorousFlowJoiner.C#L51
Nice detective work! BTW - I'm going to answer your post that's quite a ways up so we are clear on what Valgrind is telling you:
So i'm getting the same error as you. But i don't really understand what it means and certainly not how to fix it. Is it that a Material property is not being correctly initialized?
Valgrind can only tell you so much since the problem is the absence of something happening (e.g. writing to memory before attempting to read back from it). The error report tells you two things: where a particular chunk of memory was allocated (it doesn't have anything to do with the size that's going into the allocation request). and where that memory was first read (in this case finally down in PETSc). Since there is no write in between, Valgrind knows there is a problem.
Without really diving into this ticket (since John is working on it), my guess is that the number of Dirac points is the critical piece here. You've said several times that a small number works but a big number doesn't. It appears that MOOSE is allocating enough space for the larger number of points, but the problem is that we aren't ever filling in that space. This means that there's a loop somewhere that's not indexing to the right number of values. Maybe as you've already figured out, it's the material reinit loop? Perhaps we are only covering the "normal" number of quadrature points and not hitting the extra ones due to the Dirac kernel being present?
Yep, the drelative_permeability_dvar (and similar) is the problem.
So by "is the problem" I guess you mean you somehow commented this line out and the valgrind errors went away?
Regarding this line of code
I don't think we have any other instances in the framework where declarePropertyDerivative()
is used in a ternary operator. I don't see any problem with it in principle, but it might be simple to try running the code with one branch or the other hard-coded and see if that changes anything? I also don't really see how _drelative_permeability_ds
is related to _drelative_permeability_dvar
...
Yes, by "the problem" i mean that valgrind reports no error when i simply replace all appearances of _drelative_permeability_dvar
(and the other _dXXX_dvar
Material Properties) with _relative_permeability
(and XXX
) in this chunk of code
https://github.com/idaholab/moose/blob/256b2a0c10679ed36ba780f36743bf4ae153a56b/modules/porous_flow/src/dirackernels/PorousFlowLineSink.C#L283-L329
Today i will attempt to get rid of that ternary operator in the constructor and see what happens, but doubt it's the problem as we're using ternary operators all over the place in PorousFlow and have no other problems. If it is the ternary it must be that it must be a problem with the DerivativeMaterials. Getting rid of the ternary will be a large chore because there are many of them: don't try it yourself!
Rationale
Calling @jwpeterson , @andrsd , @rwcarlsen (or anyone else who's keen!).
My DiracKernels don't produce the correct Jacobian when there are too many Dirac points. Can you help, please?
Description
For instance, see
https://github.com/idaholab/moose/blob/devel/modules/porous_flow/test/tests/jacobian/line_sink03.i
That is a very old test and has been skipped since its inception with the comment
skip = 'Only works sporadically -- too many qps?'
.This has not really effected me much as i just try to keep the number of Dirac points to a minimum in each element. But it's starting to annoy me and i wonder if there's something quite fundamentally wrong somewhere.
As the "skip" comment mentions: this is a sporadic failure - sometimes the Jacobian is fine, and sometimes it's not!
Impact
Correct DiracKernel implementation in PorousFlow = good.