sxs-collaboration / spectre

SpECTRE is a code for multi-scale, multi-physics problems in astrophysics and gravitational physics.
https://spectre-code.org
Other
158 stars 187 forks source link

GH+CCE runs on wheeler #3782

Open Sizheng-Ma opened 2 years ago

Sizheng-Ma commented 2 years ago

Bug reports:

Expected behavior:

Current behavior:

I'm trying to run the CCE-GH executable #2323 on wheeler. The GH domain is as follows

DomainCreator:
  Shell:
    InnerRadius: 1.9
    OuterRadius: 200.
    InitialGridPoints: [8,8]
    InitialRefinement: 2
    UseEquiangularMap: true
    AspectRatio: 1.0
    WhichWedges: All
    TimeDependence: None
    RadialDistribution: [Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic,Logarithmic]
    RadialPartitioning: [3.0, 5.0, 7.0, 10.0, 20.0,40.0, 60.0,80,120,160]
    BoundaryConditions:
      OuterBoundary:
        ConstraintPreservingBjorhus:
                Type: ConstraintPreservingPhysical
      InnerBoundary:
        Outflow

and the CCE grid is

Cce:
  LMax: 14
  ExtractionRadius: 198
  NumberOfRadialPoints: 58

Running on 4 nodes, the system proceeds at least one time step per second. However, if I add one more radial or angular grid point for each element (namely InitialGridPoints: [9,8] or InitialGridPoints: [8,9]), the system won't take even a single time step within five minutes. This issue doesn't happen all the time but pretty frequently, and this issue is gone after I request fewer nodes. The CCE grid doesn't have an impact on this issue.

Environment:

Using all modules in wheeler_clang.sh

Detailed discussion:

Sizheng-Ma commented 2 years ago

Using 3 nodes, the run dies after 25M. The error message is as follows

Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946d58] Address for addr2line: 0x2946d58
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946b16] Address for addr2line: 0x2946b16
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8efe6) [0x7f317bc44fe6] Address for addr2line: 0x8efe6
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8f031) [0x7f317bc45031] Address for addr2line: 0x8f031
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x1d2ca0b] Address for addr2line: 0x1d2ca0b
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE26reg_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEiv+0) [0x1f723f0] Address for addr2line: 0x1f723f0

/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE28_call_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEvPvS5C_+0x193) [0x1f72a33] Address for addr2line: 0x1f72a33
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(CkDeliverMessageFree+0x21) [0x4930481] Address for addr2line: 0x4930481
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x41) [0x4955f21] Address for addr2line: 0x4955f21
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_Z15_processHandlerPvP11CkCoreState+0x359) [0x4937e49] Address for addr2line: 0x4937e49
End shortened stack trace.

Node: 2 Proc: 46
Line: 17 of /home/sma/spectre/src/Parallel/InitializationFunctions.cpp
Function: auto setup_error_handling()::(anonymous class)::operator()() const
Terminated due to an uncaught exception: vector::_M_default_append
kidder commented 2 years ago

The GR domain you are using seems to be overkill in resolution (44 radial elements by 96 angular elements, each with 512 grid points is over 2 million grid points...)

kidder commented 2 years ago

Also have you run the problem with an executable compiled with build type Debug instead of Release?

Sizheng-Ma commented 2 years ago

The GR domain you are using seems to be overkill in resolution (44 radial elements by 96 angular elements, each with 512 grid points is over 2 million grid points...)

I do need more than 2 million grid points otherwise the ringdown is pretty noisy.

Sizheng-Ma commented 2 years ago

The run died due to floating point exception after I switched to debug mode

Sizheng-Ma commented 2 years ago
############ ERROR ############
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/GH1/EvolveGhCceKerrSchild() [0xb05978d] Address for addr2line: 0xb05978d
/usr/lib64/libc.so.6(+0x35670) [0x7f945c726670] Address for addr2line: 0x35670
/usr/local/openblas/0.2.18/lib/libopenblas.so.0(dgemm_kernel+0x19b8) [0x7f945dc1e5b8] Address for addr2line: 0x2dd5b8
End shortened stack trace.

Node: 2 Proc: 67
Line: 23 of /home/sma/spectre/src/Utilities/ErrorHandling/FloatingPointExceptions.cpp
Function: void (anonymous namespace)::fpe_signal_handler(int)
Floating point exception!
############ ERROR ############
Sizheng-Ma commented 2 years ago

The error message when address sanitizer is on.

==14984==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==24349==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==6129==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
srun: error: wheeler085: task 0: Aborted
==14984==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
srun: error: wheeler087: task 2: Aborted
srun: error: wheeler086: task 1: Aborted
==24349==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
==6129==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
wthrowe commented 2 years ago

failed while trying to map 0xdfff0001000 bytes

That's almost 14 terabytes.

Something is weird here. Every test seems to give a completely different error.

wthrowe commented 2 years ago

Does the run still fail if you reduce the number of elements enough that you can run on one node? It may not be a useful run, but single-node runs are easier to debug.

Sizheng-Ma commented 2 years ago

Yes, it still fails after I reduce the resolution and run it on a single node.

Evolution:
  InitialTime: 0.0
  InitialTimeStep: 0.002
  TimeStepper:
    AdamsBashforthN:
      Order: 3

DomainCreator:
  Shell:
    InnerRadius: 1.9
    OuterRadius: 200.
    InitialGridPoints: [2,2]
    InitialRefinement: 0
    UseEquiangularMap: true
    AspectRatio: 1.0
    WhichWedges: All
    TimeDependence: None
    RadialDistribution: [Logarithmic,Logarithmic]
    RadialPartitioning: [3.0]
    BoundaryConditions:
      OuterBoundary:
        ConstraintPreservingBjorhus:
                Type: ConstraintPreservingPhysical
      InnerBoundary:
        Outflow

AnalyticSolution:
  KerrSchild:
    Mass: 1.0
    Spin: [0.0, 0.0, 0.0]
    Center: [0.0, 0.0, 0.0]

EvolutionSystem:
  GeneralizedHarmonic:
    # The parameter choices here come from our experience with the Spectral
    # Einstein Code (SpEC). They should be suitable for evolutions of a
    # perturbation of a Kerr-Schild black hole.
    DhGaugeParameters:
      RollOnStartTime: 100000.0
      RollOnTimeWindow: 100.0
      SpatialDecayWidth: 50.0
      Amplitudes: [1.0, 1.0, 1.0]
      Exponents: [4, 4, 4]
    DampingFunctionGamma0:
      GaussianPlusConstant:
        Constant: 0.001
        Amplitude: 3.0
        Width: 11.313708499
        Center: [0.0, 0.0, 0.0]
    DampingFunctionGamma1:
      GaussianPlusConstant:
        Constant: -1.0
        Amplitude: 0.0
        Width: 11.313708499
        Center: [0.0, 0.0, 0.0]
    DampingFunctionGamma2:
      GaussianPlusConstant:
        Constant: 0.001
        Amplitude: 1.0
        Width: 11.313708499
        Center: [0.0, 0.0, 0.0]

SpatialDiscretization:
  DiscontinuousGalerkin:
    Formulation: StrongInertial
    Quadrature: GaussLobatto
  BoundaryCorrection:
    UpwindPenalty:

EventsAndTriggers:
  ? Slabs:
      EvenlySpaced:
        Interval: 2000
        Offset: 0
  : - ObserveErrorNorms:
        SubfileName: Errors
  ? Slabs:
      EvenlySpaced:
        Interval: 300000
        Offset: 0
  : - ObserveFields:
        SubfileName: VolumeData
        VariablesToObserve:
          #- SpacetimeMetric
          #- Pi
          #- Phi
          - PointwiseL2Norm(GaugeConstraint)
          #- PointwiseL2Norm(ThreeIndexConstraint)
          #- PointwiseL2Norm(FourIndexConstraint)
        InterpolateToMesh: None
        CoordinatesFloatingPointType: Double
        FloatingPointTypes: [Double]
  #? Slabs:
  #    EvenlySpaced:
  #      Interval: 5
  #      Offset: 2
  #: - AhA
  ? Slabs:
      Specified:
        Values: [30000000000000000]
  : - Completion

Observers:
  VolumeFileName: "GhKerrSchildVolume"
  ReductionFileName: "GhKerrSchildReductions"
ApparentHorizons:
  AhA:
    InitialGuess:
      Lmax: 12
      Radius: 2.0
      Center: [0.0, 0.0, 0.0]
    FastFlow:
      Flow: Fast
      Alpha: 1.0
      Beta: 0.5
      AbsTol: 1e-12
      TruncationTol: 1e-2
      DivergenceTol: 1.2
      DivergenceIter: 5
      MaxIts: 100
    Verbosity: Verbose

Cce:
  Evolution:
    TimeStepper:
      AdamsBashforthN:
        Order: 3
    InitialSlabSize: 0.002
    StepChoosers:
      - Constant: 1.0
      - Increase:
          Factor: 2
      - ErrorControl(SwshVars):
          AbsoluteTolerance: 1e-8
          RelativeTolerance: 1e-6
          MaxFactor: 2
          MinFactor: 0.25
          SafetyFactor: 0.9
      - ErrorControl(CoordVars):
          AbsoluteTolerance: 1e-8
          RelativeTolerance: 1e-7
          MaxFactor: 2
          MinFactor: 0.25
          SafetyFactor: 0.9
    StepController:
      BinaryFraction

  LMax: 14
  ExtractionRadius: 198
  NumberOfRadialPoints: 14
  ObservationLMax: 4

  InitializeJ:
    InverseCubic

  StartTime: 0.0

  Filtering:
    RadialFilterHalfPower: 24
    RadialFilterAlpha: 35.0
    FilterLMax: 12

  GhInterfaceManager:
    #GhLocalTimeStepping:
    #  AdamsBashforthOrder: 3
    GhLockstep:

  ScriInterpOrder: 5
  ScriOutputDensity: 1

InterpolationTargets:
  CceWorldtubeTarget:
        Lmax: 14
        Center: [0.0, 0.0, 0.0]
        DimensionlessSpin: [0.0, 0.0, 0.0]
        Mass: 99
        ThetaVariesFastest: false

PhaseChangeAndTriggers:
  - - Slabs:
        Specified:
          Values: [5,55,105]
    - - VisitAndReturn(LoadBalancing)
  - - Slabs:
       EvenlySpaced:
         Interval: 100
         Offset: 0
    - - CheckpointAndExitAfterWallclock:
          WallclockHours: 23.4

Filtering:
  ExpFilter0:
    Alpha: 36.0
    HalfPower: 150
    DisableForDebugging: true

The submission script is

#!/bin/bash -
#SBATCH -o spectre.out
#SBATCH -e spectre.err
#SBATCH --ntasks-per-node 24
#SBATCH -A sxs
#SBATCH --no-requeue
#SBATCH -J tt_ccm_GH1
#SBATCH --nodes 1
#SBATCH -t 24:00:00
#SBATCH --mem=50GB

# Distributed under the MIT License.
# See LICENSE.txt for details.

# To run a job on Wheeler:
# - Set the -J, --nodes, and -t options above, which correspond to job name,
#   number of nodes, and wall time limit in HH:MM:SS, respectively.
# - Set the build directory, run directory, executable name,
#   and input file below. The input file path is relative to ${RUN_DIR}.
#
# NOTE: The executable will not be copied from the build directory, so if you
#       update your build directory this file will use the updated executable.
#
# Optionally, if you need more control over how SpECTRE is launched on
# Wheeler you can edit the launch command at the end of this file directly.
#
# To submit the script to the queue run:
#   sbatch Wheeler.sh

export SPECTRE_BUILD_DIR=/home/sma/spectre/build/
export RUN_DIR=${PWD}/Run
export SPECTRE_EXECUTABLE=${PWD}/EvolveGhCceKerrSchild
export SPECTRE_INPUT_FILE=${PWD}/KerrSchildWithCce.yaml

############################################################################
# Set desired permissions for files created with this script
umask 0022

export PATH=${SPECTRE_BUILD_DIR}/bin:$PATH
mkdir ${RUN_DIR}
cd ${RUN_DIR}

# The 23 is there because Charm++ uses one thread per node for communication
srun -n ${SLURM_JOB_NUM_NODES} -c 24 \
     ${SPECTRE_EXECUTABLE} ++ppn 23 \
     --input-file ${SPECTRE_INPUT_FILE}
wthrowe commented 2 years ago

OK, good. If you are running on one node, you should be able to log into that node and attach gdb to the process (gdb -p <PID>) and let it run until it fails. Don't forget to give the gdb command catch throw before continuing the execution so you will see C++ exceptions.

I don't know how familiar you are with gdb, but you shouldn't need anything esoteric for this. Googling for a basic gdb guide should get you what you need if you're not familiar.

Sizheng-Ma commented 2 years ago

The run failed immediately with the address sanitizer on. I didn't have a chance to try gdb.

wthrowe commented 2 years ago

What if you try it with the address sanitizer off?

Sizheng-Ma commented 2 years ago

The run fails immediately if I use debug mode

The error message is as follows

############ ERROR ############
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/GH1/EvolveGhCceKerrSchild() [0xb05978d] Address for addr2line: 0xb05978d
/usr/lib64/libc.so.6(+0x35670) [0x7f945c726670] Address for addr2line: 0x35670
/usr/local/openblas/0.2.18/lib/libopenblas.so.0(dgemm_kernel+0x19b8) [0x7f945dc1e5b8] Address for addr2line: 0x2dd5b8
End shortened stack trace.

Node: 2 Proc: 67
Line: 23 of /home/sma/spectre/src/Utilities/ErrorHandling/FloatingPointExceptions.cpp
Function: void (anonymous namespace)::fpe_signal_handler(int)
Floating point exception!
############ ERROR ############

The run starts normally if I use release mode, but it fails after 25M. The error message is

Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946d58] Address for addr2line: 0x2946d58
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946b16] Address for addr2line: 0x2946b16
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8efe6) [0x7f317bc44fe6] Address for addr2line: 0x8efe6
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8f031) [0x7f317bc45031] Address for addr2line: 0x8f031
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x1d2ca0b] Address for addr2line: 0x1d2ca0b
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE26reg_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEiv+0) [0x1f723f0] Address for addr2line: 0x1f723f0

/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE28_call_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEvPvS5C_+0x193) [0x1f72a33] Address for addr2line: 0x1f72a33
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(CkDeliverMessageFree+0x21) [0x4930481] Address for addr2line: 0x4930481
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x41) [0x4955f21] Address for addr2line: 0x4955f21
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_Z15_processHandlerPvP11CkCoreState+0x359) [0x4937e49] Address for addr2line: 0x4937e49
End shortened stack trace.

Node: 2 Proc: 46
Line: 17 of /home/sma/spectre/src/Parallel/InitializationFunctions.cpp
Function: auto setup_error_handling()::(anonymous class)::operator()() const
Terminated due to an uncaught exception: vector::_M_default_append
Sizheng-Ma commented 2 years ago

The same run proceeds normally on Frontera without failure

kidder commented 2 years ago

Can you do a run on wheeler using EvolveGhKerrSchild with the same input file (commenting out the CCE parts)

Sizheng-Ma commented 2 years ago

The EvolveGhKerrSchild one goes normally (debug mode).

kidder commented 2 years ago

The floating point exceptions are being generated by interpolating nans for the time derivatives of the GH variables to the world tube. This will happen in debug mode (and not release mode) if a DataMesh/Variables/Tensor is allocated but not initialized. (In debug mode the allocation initializes the DataMesh to nan in order to catch this problem. In release mode the executable will use whatever values happen to be in memory leading to random behavior). My suspicion is that the culprit is the change the order of step_actions commit on your branch which moved computing the time derivative to after sending the next time to CCE and interpolating to the target

Sizheng-Ma commented 2 years ago

Indeed, the FPE is gone after I switch the order back. Since the Bjorhus boundary condition is applied within ComputeTimeDerivative, all CCE calculations need to be done before it so that I can use the Weyl scalar psi0 to complete the boundary condition. In practice, the time derivatives of the GH variables are not used by the CCE component so their random values shouldn't matter.

kidder commented 2 years ago

so if the time derivatives of GH variables are not used by CCE, why are they being passed to CCE?