sxs-collaboration / spectre

SpECTRE is a code for multi-scale, multi-physics problems in astrophysics and gravitational physics.
https://spectre-code.org
Other
160 stars 188 forks source link

Large memory usage + FPE in InitializeTimeStepperHistory Phase for GhMhd #5629

Open isaaclegred opened 11 months ago

isaaclegred commented 11 months ago

Bug reports:

Expected behavior:

A Tov star should be initializable on a sphere domain with a reasonable range of EoSs and central density

Current behavior:

For many EoS-central density configurations, InitializeTimeStepperHistory phase fails, with substantial memory usage. It seems though, that after expanding to multiple nodes (>=6?) the failure is actually due to a floating point error. Traceback example:

Terminated due to an uncaught exception:

############ ERROR ############
Stack trace:

  0. [error handling]
  1. /panfs/ds09/sxs/isaaclegred/spectre/build/bin/EvolveGhValenciaDivCleanTovStar() [0x4a97509] - Resolve source file and line with: addr2line -fCpe /panfs/ds09/sxs/isaaclegred/spectre/build/bin/EvolveGhValenciaDivCleanTovStar 0x4a97509
  2. [error handling]
  3. /usr/lib64/libpthread.so.0(+0xf100) [0x7f83e062d100] - Resolve source file and line with: addr2line -fCpe /usr/lib64/libpthread.so.0 0xf100
  4. void hydro::relativistic_specific_enthalpy<DataVector>(gsl::not_null<Tensor<DataVector, brigand::list<>, brigand::list<> >*>, Tensor<DataVector, brigand::list<>, brigand::list<> > const&, Tensor<DataVector, brigand::list<>, brigand::list<> > const&, Tensor<DataVector, brigand::list<>, brigand::list<> > const&) in /panfs/ds09/sxs/isaaclegred/spectre/src/PointwiseFunctions/Hydro/SpecificEnthalpy.cpp:18
  5. _ZN5grmhd16ValenciaDivClean2fd40compute_conservatives_for_reconstructionIN7brigand4listIJN2gr4Tags15SpacetimeMetricI10DataVectorLm3EN5Frame8InertialEEEN2gh4Tags2PiIS8_Lm3ESA_EENSD_3PhiIS8_Lm3ESA_EENS0_4Tags6TildeDENSI_7TildeYeENSI_8TildeTauENSI_6TildeSISA_EENSI_6TildeBISA_EENSI_8TildePhiEN5hydro4Tag [...] rILm3EEEEEELm1EEEvN3gsl8not_nullIP9VariablesIT_EEERKN16EquationsOfState15EquationOfStateILb1EXT0_EEE in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/Systems/GrMhd/ValenciaDivClean/FiniteDifference/ReconstructWork.tpp:126
  6. _ZN5grmhd18GhValenciaDivClean2fd22reconstruct_prims_workIN7brigand4listIJN2gr4Tags15SpacetimeMetricI10DataVectorLm3EN5Frame8InertialEEEEEENS4_IJN5hydro4Tags15RestMassDensityIS8_EENSE_16ElectronFractionIS8_EENSE_11TemperatureIS8_EENSE_33LorentzFactorTimesSpatialVelocityIS8_Lm3ESA_EENSE_13MagneticFiel [...] T5_RKT6_RKT7_RKS2G_IS3N_ERKS2G_IS3Q_ERKNS2T_ILb1EXT4_EEES30_RKS31_ILm24ES37_S2G_IT8_ES3C_S3E_ES3L_mb in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/Systems/GrMhd/GhValenciaDivClean/FiniteDifference/ReconstructWork.tpp:230
  7. _ZNK5grmhd18GhValenciaDivClean2fd22MonotonisedCentralPrim11reconstructILm1EN7brigand4listIJN2gr4Tags15SpacetimeMetricI10DataVectorLm3EN5Frame8InertialEEEN2gh4Tags2PiIS9_Lm3ESB_EENSE_3PhiIS9_Lm3ESB_EENS_16ValenciaDivClean4Tags6TildeDENSK_7TildeYeENSK_8TildeTauENSK_6TildeSISB_EENSK_6TildeBISB_EENSK_8T [...] ionILm3EE9ElementIdILm3EEENS22_7subcell9GhostDataEN5boost4hashIS34_EESt8equal_toIS34_EERK4MeshILm3EE in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/Systems/GrMhd/GhValenciaDivClean/FiniteDifference/MonotonisedCentral.cpp:92
  8. _Z22call_with_dynamic_typeIvN7brigand4listIJN5grmhd18GhValenciaDivClean2fd22MonotonisedCentralPrimENS4_37PositivityPreservingAdaptiveOrderPrimENS4_10Wcns5zPrimEEEEKNS4_13ReconstructorEZZNS3_7subcell14TimeDerivative5applyINS1_IJN8Parallel4Tags17MetavariablesImplI17EvolutionMetavarsIN2gh9Solutions9Wra [...] orrections13UpwindPenaltyILm3EEENS1V_19BoundaryCorrections7RusanovEEEEEDaSAS_EUlRSAR_E_ESAN_PT1_OT2_ in /panfs/ds09/sxs/isaaclegred/spectre/src/Utilities/CallWithDynamicType.hpp:26
  9. _ZZN5grmhd18GhValenciaDivClean7subcell14TimeDerivative5applyIN7brigand4listIJN8Parallel4Tags17MetavariablesImplI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEEENS7_14ArrayIndexImplI9ElementIdILm3EEEENS7_16GlobalCacheProxyISI_EEN4Tags4TimeEN1 [...] ectionsINSA_19BoundaryCorrections13UpwindPenaltyILm3EEENS1N_19BoundaryCorrections7RusanovEEEEEDaSAL_ in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/Systems/GrMhd/GhValenciaDivClean/Subcell/TimeDerivative.hpp:574
 10. _Z22call_with_dynamic_typeIvN7brigand4listIJN5grmhd18GhValenciaDivClean19BoundaryCorrections20ProductOfCorrectionsIN2gh19BoundaryCorrections13UpwindPenaltyILm3EEENS2_16ValenciaDivClean19BoundaryCorrections3HllEEENS5_IS9_NSB_7RusanovEEEEEEKNS4_18BoundaryCorrectionEZNS3_7subcell14TimeDerivative5applyI [...] patialMetricComputeIS1T_Lm3ES1V_EEEEEEEvN3gsl8not_nullIPN2db7DataBoxIT_EEEEEUlPKSAU_E0_ESAU_PT1_OT2_ in /panfs/ds09/sxs/isaaclegred/spectre/src/Utilities/CallWithDynamicType.hpp:41
 11. _ZN5grmhd18GhValenciaDivClean7subcell14TimeDerivative5applyIN7brigand4listIJN8Parallel4Tags17MetavariablesImplI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEEENS7_14ArrayIndexImplI9ElementIdILm3EEEENS7_16GlobalCacheProxyISI_EEN4Tags4TimeEN14 [...] Lm3ES2G_EENS1C_27SqrtDetSpatialMetricComputeIS1E_Lm3ES1G_EEEEEEEvN3gsl8not_nullIPN2db7DataBoxIT_EEEE in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/Systems/GrMhd/GhValenciaDivClean/Subcell/TimeDerivative.hpp:813
 12. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] _4TimeEEES1Z_EEEEEEEEES6F_E22invoke_iterable_actionIS5J_St17integral_constantImLm2EES6J_ImLm24EEEEbv in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:1101
 13. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] m28ELm29ELm30ELm31ELm32ELm33ELm34ELm35ELm36ELm37ELm38ELm39ELm40EEEEbSt16integer_sequenceImJXspT0_EEE in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:1040
 14. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] EEEENSE_ILSF_13ENSD_IJNS25_18RunEventsOnFailureINS1F_4TimeEEES1Z_EEEEEEEEES6F_E17perform_algorithmEv in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:914
 15. _ZN28CProxyElement_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJN8Parallel12PhaseActionsILNSD_5PhaseE9ENSC_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISA_Lb0EEEN9evolution [...] 5tupleIJ4MeshILm3EES6T_ILm2EESt8optionalIS2C_ES6X_10TimeStepIdiEEES6Y_EEvOT1_OT0_bPK14CkEntryOptions in /panfs/ds09/sxs/isaaclegred/spectre/build/src/Parallel/Algorithms/AlgorithmArray.def.h:598
 16. _ZN9evolution2dg7subcell7Actions25SendDataForReconstructionILm3EN5grmhd18GhValenciaDivClean7subcell23PrimitiveGhostVariablesELb0EE5applyIN7brigand4listIJN8Parallel4Tags17MetavariablesImplI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEEENSD_1 [...] optionalImEEERN2db7DataBoxIT_EERN6tuples11TaggedTupleIJDpT0_EEERNSC_11GlobalCacheIT4_EERKT1_T2_PKT3_ in /panfs/ds09/sxs/isaaclegred/spectre/src/Evolution/DgSubcell/Actions/ReconstructionCommunication.hpp:250
 17. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] _4TimeEEES1Z_EEEEEEEEES6F_E22invoke_iterable_actionIS57_St17integral_constantImLm2EES6J_ImLm20EEEEbv in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:1101
 18. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] m28ELm29ELm30ELm31ELm32ELm33ELm34ELm35ELm36ELm37ELm38ELm39ELm40EEEEbSt16integer_sequenceImJXspT0_EEE in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:1040
 19. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] EEEENSE_ILSF_13ENSD_IJNS25_18RunEventsOnFailureINS1F_4TimeEEES1Z_EEEEEEEEES6F_E17perform_algorithmEv in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:914
 20. _ZN8Parallel17DistributedObjectI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJNS_12PhaseActionsILNS_5PhaseE9ENSD_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISB_Lb0EEEN9evolution2dg14Initi [...] geEEEEEENSE_ILSF_13ENSD_IJNS25_18RunEventsOnFailureINS1F_4TimeEEES1Z_EEEEEEEEES6F_E11start_phaseESF_ in /panfs/ds09/sxs/isaaclegred/spectre/src/Parallel/DistributedObject.hpp:976
 21. _ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN2gh9Solutions9WrappedGrIN17RelativisticEuler9Solutions7TovStarEEELb0EJ10BondiSachsEEN7brigand4listIJN8Parallel12PhaseActionsILNSD_5PhaseE9ENSC_IJN14Initialization7Actions15InitializeItemsIJNSG_12TimeSteppingISA_Lb0EEEN9evolution2dg14I [...] 18RunEventsOnFailureINS1F_4TimeEEES1Z_EEEEEEEEE9ElementIdILm3EEE27_call_start_phase_marshall8EPvS6K_ in /panfs/ds09/sxs/isaaclegred/spectre/build/src/Parallel/Algorithms/AlgorithmArray.def.h:1347
 22. CkDeliverMessageReadonly in /usr/local/charm/7.0.0-intelmpi/src/ck-core/ck.C:587
 23. CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) in /usr/local/charm/7.0.0-intelmpi/src/ck-core/cklocation.C:2263
 24. CkArrayBroadcaster::deliver(CkArrayMessage*, ArrayElement*, bool) in /usr/local/charm/7.0.0-intelmpi/src/ck-core/ckarray.C:1371
 25. CkArray::recvBroadcast(CkMessage*) in /usr/local/charm/7.0.0-intelmpi/src/ck-core/ckarray.C:1683
 26. CkDeliverMessageFree in /usr/local/charm/7.0.0-intelmpi/src/ck-core/ck.C:553
 27. _processHandler(void*, CkCoreState*) in /usr/local/charm/7.0.0-intelmpi/src/ck-core/ck.C:1250
 28. CsdScheduleForever in /usr/local/charm/7.0.0-intelmpi/src/conv-core/convcore.C:1943
 29. CsdScheduler in /usr/local/charm/7.0.0-intelmpi/src/conv-core/convcore.C:1888
 30. ConverseRunPE in /usr/local/charm/7.0.0-intelmpi/src/arch/util/machine-common-core.C:1615
 31. call_startfn in /usr/local/charm/7.0.0-intelmpi/src/arch/util/machine-smp.C:372
 32. /usr/lib64/libpthread.so.0(+0x7dc5) [0x7f83e0625dc5] - Resolve source file and line with: addr2line -fCpe /usr/lib64/libpthread.so.0 0x7dc5
 33. clone - Resolve source file and line with: addr2line -fCpe /usr/lib64/libc.so.6 0xf6ced

Wall time: 00:03:12
Node: 5 Proc: 117
void {anonymous}::fpe_signal_handler(int) in /panfs/ds09/sxs/isaaclegred/spectre/src/Utilities/ErrorHandling/FloatingPointExceptions.cpp:34

Floating point exception!
############ ERROR ############

Environment:

Add as an attachment $SPECTRE_BUILD_DIR/BuildInfo.txt or add its contents here. Develop @e621697ef, wheeler_gcc.sh environment.

Feature request:

Component:

Desired feature:

Detailed discussion:

The input file needed to reproduce is just the Ghmhd TOV star test input file (tests/InputFiles/GrMhd/GhValenciaDivClean/GhMhdTovStar.yaml) with the domain exchanged with a sphere

AnalyticSolution: &InitialData
  GeneralizedHarmonic(TovStar):
    CentralDensity: 1.38e-3
    EquationOfState:
        PolytropicFluid:
            PolytropicConstant: 100.0
            PolytropicExponent: 2.0
    Coordinates: Schwarzschild

DomainCreator:
  Sphere:
    InnerRadius: 34.6410161514
    OuterRadius: 100
    Interior:
        FillWithSphericity: 0.0
    InitialRefinement: 4
    InitialGridPoints: [6, 6, 6]
    EquatorialCompression: None
    WhichWedges: All
    RadialPartitioning: []
    RadialDistribution: [Linear]
    UseEquiangularMap: true
    TimeDependentMaps: None
    OuterBoundaryCondition:
      DirichletAnalytic:
        AnalyticPrescription: *InitialData
wthrowe commented 11 months ago

Do you have a similar configuration that works, for comparison?

isaaclegred commented 11 months ago

Actually I'm not sure if anything is actually working, including other time steppers. I think this problem is happening across hydro, and may actually have nothing to do with memory if it turns out that the FPE is somehow causing ridiculous memory usage. It seems every configuration I can generate with both brick and sphere domains fails on develop with the default settings in the input file. it fails with the same error as above, and always after using up a large amount of memory.

The one example I thought was working is on a branch and it's using a different runtime EoS from initial data EoS which is not currently supported in develop, but now it's not clear if that was working either or just failed more slowly. One thing that characterized that run was that it wasn't using atmosphere anywhere. Maybe the default hydro settings from the input file just don't work, but I've been playing around with them and can't get the problem to go away.

Maybe @nilsdeppe could provide an example of an input file that really should work?

wthrowe commented 11 months ago

I can reproduce an FPE on my desktop at lower resolution. (Initial refinement 3 instead of 4. I don't have enough RAM for 4.)

It doesn't seem to have anything to do with self-start. I get it even doing a single Euler step. The run reached about 13GB of RAM before crashing. I don't have enough experience with this system to know if that's unreasonable or not.

wthrowe commented 11 months ago

Best guess based on the backtrace is the density is zero somewhere, so hydro::relativistic_specific_enthalpy FPEs.

wthrowe commented 11 months ago

The FPE cause appears to be that the analytic solution is zero at the domain boundary, and the DirichletAnalytic boundary condition can't handle that. (This is a documented limitation of the class.)

nilsdeppe commented 11 months ago

I will see how easily I can replace $h$ with $\rho h$ as the primitive. I think this should be fairly trivial. That'll just fix the origin of the FPE. We don't actually use $h$ anywhere

wthrowe commented 5 months ago

The FPE should be fixed from #5631. Is there still a problem with memory, or does that resolve this issue?