Open Sizheng-Ma opened 2 years ago
Using 3 nodes, the run dies after 25M
. The error message is as follows
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946d58] Address for addr2line: 0x2946d58
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946b16] Address for addr2line: 0x2946b16
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8efe6) [0x7f317bc44fe6] Address for addr2line: 0x8efe6
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8f031) [0x7f317bc45031] Address for addr2line: 0x8f031
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x1d2ca0b] Address for addr2line: 0x1d2ca0b
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE26reg_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEiv+0) [0x1f723f0] Address for addr2line: 0x1f723f0
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE28_call_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEvPvS5C_+0x193) [0x1f72a33] Address for addr2line: 0x1f72a33
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(CkDeliverMessageFree+0x21) [0x4930481] Address for addr2line: 0x4930481
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x41) [0x4955f21] Address for addr2line: 0x4955f21
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_Z15_processHandlerPvP11CkCoreState+0x359) [0x4937e49] Address for addr2line: 0x4937e49
End shortened stack trace.
Node: 2 Proc: 46
Line: 17 of /home/sma/spectre/src/Parallel/InitializationFunctions.cpp
Function: auto setup_error_handling()::(anonymous class)::operator()() const
Terminated due to an uncaught exception: vector::_M_default_append
The GR domain you are using seems to be overkill in resolution (44 radial elements by 96 angular elements, each with 512 grid points is over 2 million grid points...)
Also have you run the problem with an executable compiled with build type Debug instead of Release?
The GR domain you are using seems to be overkill in resolution (44 radial elements by 96 angular elements, each with 512 grid points is over 2 million grid points...)
I do need more than 2 million grid points otherwise the ringdown is pretty noisy.
The run died due to floating point exception after I switched to debug mode
############ ERROR ############
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/GH1/EvolveGhCceKerrSchild() [0xb05978d] Address for addr2line: 0xb05978d
/usr/lib64/libc.so.6(+0x35670) [0x7f945c726670] Address for addr2line: 0x35670
/usr/local/openblas/0.2.18/lib/libopenblas.so.0(dgemm_kernel+0x19b8) [0x7f945dc1e5b8] Address for addr2line: 0x2dd5b8
End shortened stack trace.
Node: 2 Proc: 67
Line: 23 of /home/sma/spectre/src/Utilities/ErrorHandling/FloatingPointExceptions.cpp
Function: void (anonymous namespace)::fpe_signal_handler(int)
Floating point exception!
############ ERROR ############
The error message when address sanitizer is on.
==14984==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==24349==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==6129==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
srun: error: wheeler085: task 0: Aborted
==14984==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
srun: error: wheeler087: task 2: Aborted
srun: error: wheeler086: task 1: Aborted
==24349==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
==6129==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
failed while trying to map 0xdfff0001000 bytes
That's almost 14 terabytes.
Something is weird here. Every test seems to give a completely different error.
Does the run still fail if you reduce the number of elements enough that you can run on one node? It may not be a useful run, but single-node runs are easier to debug.
Yes, it still fails after I reduce the resolution and run it on a single node.
Evolution:
InitialTime: 0.0
InitialTimeStep: 0.002
TimeStepper:
AdamsBashforthN:
Order: 3
DomainCreator:
Shell:
InnerRadius: 1.9
OuterRadius: 200.
InitialGridPoints: [2,2]
InitialRefinement: 0
UseEquiangularMap: true
AspectRatio: 1.0
WhichWedges: All
TimeDependence: None
RadialDistribution: [Logarithmic,Logarithmic]
RadialPartitioning: [3.0]
BoundaryConditions:
OuterBoundary:
ConstraintPreservingBjorhus:
Type: ConstraintPreservingPhysical
InnerBoundary:
Outflow
AnalyticSolution:
KerrSchild:
Mass: 1.0
Spin: [0.0, 0.0, 0.0]
Center: [0.0, 0.0, 0.0]
EvolutionSystem:
GeneralizedHarmonic:
# The parameter choices here come from our experience with the Spectral
# Einstein Code (SpEC). They should be suitable for evolutions of a
# perturbation of a Kerr-Schild black hole.
DhGaugeParameters:
RollOnStartTime: 100000.0
RollOnTimeWindow: 100.0
SpatialDecayWidth: 50.0
Amplitudes: [1.0, 1.0, 1.0]
Exponents: [4, 4, 4]
DampingFunctionGamma0:
GaussianPlusConstant:
Constant: 0.001
Amplitude: 3.0
Width: 11.313708499
Center: [0.0, 0.0, 0.0]
DampingFunctionGamma1:
GaussianPlusConstant:
Constant: -1.0
Amplitude: 0.0
Width: 11.313708499
Center: [0.0, 0.0, 0.0]
DampingFunctionGamma2:
GaussianPlusConstant:
Constant: 0.001
Amplitude: 1.0
Width: 11.313708499
Center: [0.0, 0.0, 0.0]
SpatialDiscretization:
DiscontinuousGalerkin:
Formulation: StrongInertial
Quadrature: GaussLobatto
BoundaryCorrection:
UpwindPenalty:
EventsAndTriggers:
? Slabs:
EvenlySpaced:
Interval: 2000
Offset: 0
: - ObserveErrorNorms:
SubfileName: Errors
? Slabs:
EvenlySpaced:
Interval: 300000
Offset: 0
: - ObserveFields:
SubfileName: VolumeData
VariablesToObserve:
#- SpacetimeMetric
#- Pi
#- Phi
- PointwiseL2Norm(GaugeConstraint)
#- PointwiseL2Norm(ThreeIndexConstraint)
#- PointwiseL2Norm(FourIndexConstraint)
InterpolateToMesh: None
CoordinatesFloatingPointType: Double
FloatingPointTypes: [Double]
#? Slabs:
# EvenlySpaced:
# Interval: 5
# Offset: 2
#: - AhA
? Slabs:
Specified:
Values: [30000000000000000]
: - Completion
Observers:
VolumeFileName: "GhKerrSchildVolume"
ReductionFileName: "GhKerrSchildReductions"
ApparentHorizons:
AhA:
InitialGuess:
Lmax: 12
Radius: 2.0
Center: [0.0, 0.0, 0.0]
FastFlow:
Flow: Fast
Alpha: 1.0
Beta: 0.5
AbsTol: 1e-12
TruncationTol: 1e-2
DivergenceTol: 1.2
DivergenceIter: 5
MaxIts: 100
Verbosity: Verbose
Cce:
Evolution:
TimeStepper:
AdamsBashforthN:
Order: 3
InitialSlabSize: 0.002
StepChoosers:
- Constant: 1.0
- Increase:
Factor: 2
- ErrorControl(SwshVars):
AbsoluteTolerance: 1e-8
RelativeTolerance: 1e-6
MaxFactor: 2
MinFactor: 0.25
SafetyFactor: 0.9
- ErrorControl(CoordVars):
AbsoluteTolerance: 1e-8
RelativeTolerance: 1e-7
MaxFactor: 2
MinFactor: 0.25
SafetyFactor: 0.9
StepController:
BinaryFraction
LMax: 14
ExtractionRadius: 198
NumberOfRadialPoints: 14
ObservationLMax: 4
InitializeJ:
InverseCubic
StartTime: 0.0
Filtering:
RadialFilterHalfPower: 24
RadialFilterAlpha: 35.0
FilterLMax: 12
GhInterfaceManager:
#GhLocalTimeStepping:
# AdamsBashforthOrder: 3
GhLockstep:
ScriInterpOrder: 5
ScriOutputDensity: 1
InterpolationTargets:
CceWorldtubeTarget:
Lmax: 14
Center: [0.0, 0.0, 0.0]
DimensionlessSpin: [0.0, 0.0, 0.0]
Mass: 99
ThetaVariesFastest: false
PhaseChangeAndTriggers:
- - Slabs:
Specified:
Values: [5,55,105]
- - VisitAndReturn(LoadBalancing)
- - Slabs:
EvenlySpaced:
Interval: 100
Offset: 0
- - CheckpointAndExitAfterWallclock:
WallclockHours: 23.4
Filtering:
ExpFilter0:
Alpha: 36.0
HalfPower: 150
DisableForDebugging: true
The submission script is
#!/bin/bash -
#SBATCH -o spectre.out
#SBATCH -e spectre.err
#SBATCH --ntasks-per-node 24
#SBATCH -A sxs
#SBATCH --no-requeue
#SBATCH -J tt_ccm_GH1
#SBATCH --nodes 1
#SBATCH -t 24:00:00
#SBATCH --mem=50GB
# Distributed under the MIT License.
# See LICENSE.txt for details.
# To run a job on Wheeler:
# - Set the -J, --nodes, and -t options above, which correspond to job name,
# number of nodes, and wall time limit in HH:MM:SS, respectively.
# - Set the build directory, run directory, executable name,
# and input file below. The input file path is relative to ${RUN_DIR}.
#
# NOTE: The executable will not be copied from the build directory, so if you
# update your build directory this file will use the updated executable.
#
# Optionally, if you need more control over how SpECTRE is launched on
# Wheeler you can edit the launch command at the end of this file directly.
#
# To submit the script to the queue run:
# sbatch Wheeler.sh
export SPECTRE_BUILD_DIR=/home/sma/spectre/build/
export RUN_DIR=${PWD}/Run
export SPECTRE_EXECUTABLE=${PWD}/EvolveGhCceKerrSchild
export SPECTRE_INPUT_FILE=${PWD}/KerrSchildWithCce.yaml
############################################################################
# Set desired permissions for files created with this script
umask 0022
export PATH=${SPECTRE_BUILD_DIR}/bin:$PATH
mkdir ${RUN_DIR}
cd ${RUN_DIR}
# The 23 is there because Charm++ uses one thread per node for communication
srun -n ${SLURM_JOB_NUM_NODES} -c 24 \
${SPECTRE_EXECUTABLE} ++ppn 23 \
--input-file ${SPECTRE_INPUT_FILE}
OK, good. If you are running on one node, you should be able to log into that node and attach gdb to the process (gdb -p <PID>
) and let it run until it fails. Don't forget to give the gdb command catch throw
before continuing the execution so you will see C++ exceptions.
I don't know how familiar you are with gdb, but you shouldn't need anything esoteric for this. Googling for a basic gdb guide should get you what you need if you're not familiar.
The run failed immediately with the address sanitizer on. I didn't have a chance to try gdb
.
What if you try it with the address sanitizer off?
The run fails immediately if I use debug mode
The error message is as follows
############ ERROR ############
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/GH1/EvolveGhCceKerrSchild() [0xb05978d] Address for addr2line: 0xb05978d
/usr/lib64/libc.so.6(+0x35670) [0x7f945c726670] Address for addr2line: 0x35670
/usr/local/openblas/0.2.18/lib/libopenblas.so.0(dgemm_kernel+0x19b8) [0x7f945dc1e5b8] Address for addr2line: 0x2dd5b8
End shortened stack trace.
Node: 2 Proc: 67
Line: 23 of /home/sma/spectre/src/Utilities/ErrorHandling/FloatingPointExceptions.cpp
Function: void (anonymous namespace)::fpe_signal_handler(int)
Floating point exception!
############ ERROR ############
The run starts normally if I use release mode, but it fails after 25M
. The error message is
Shortened stack trace is:
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946d58] Address for addr2line: 0x2946d58
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x2946b16] Address for addr2line: 0x2946b16
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8efe6) [0x7f317bc44fe6] Address for addr2line: 0x8efe6
/usr/local/gcc/7.3.0/lib64/libstdc++.so.6(+0x8f031) [0x7f317bc45031] Address for addr2line: 0x8f031
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild() [0x1d2ca0b] Address for addr2line: 0x1d2ca0b
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE26reg_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEiv+0) [0x1f723f0] Address for addr2line: 0x1f723f0
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN22CkIndex_AlgorithmArrayI14DgElementArrayI17EvolutionMetavarsIN19GeneralizedHarmonic9Solutions9WrappedGrIN2gr9Solutions10KerrSchildEEES8_EN7brigand4listIJN8Parallel12PhaseActionsIN27GeneralizedHarmonicDefaults5PhaseELSF_0ENSB_IJN7Actions12SetupDataBoxEN14Initialization7Actions15TimeAndTimeStepIS9_EEN9evolution2dg14Initialization6DomainILm3ELb0EEENSJ_21NonconservativeSystemINS2_6SystemILm3EEEEENSM_14Initialization7Actions12SetVariablesIN6domain4Tags11CoordinatesILm3EN5Frame7LogicalEEEEENSJ_18TimeStepperHistoryIS9_EENSJ_17InitializeCcmTagsIS9_EENSJ_22InitializeCcmOtherTagsIS9_EENS2_7Actions30InitializeGhAnd3Plus1VariablesILm3EEENSJ_14AddComputeTagsINSB_IJNSZ_25MinimumGridSpacingComputeILm3ENS11_8InertialEEENS2_4Tags33ComputeLargestCharacteristicSpeedILm3ES1G_EENSZ_20SizeOfElementComputeILm3EEENSM_4Tags15AnalyticComputeILm3EN4Tags16AnalyticSolutionIS8_EENSB_IJNS5_4Tags15SpacetimeMetricILm3ES1G_10DataVectorEENS1I_2PiILm3ES1G_EENS1I_3PhiILm3ES1G_EEEEEEEEEEEENSO_7MortarsILm3EST_EEN5intrp7Actions23ElementInitInterpPointsINS26_4Tags15InterpPointInfoIS9_EEEENSJ_30RemoveOptionsAndTerminatePhaseEEEEEENSD_ISF_LSF_3ENSB_IJNS2_6gauges7Actions24InitializeDampedHarmonicILm3ELb1EEENS1B_21InitializeConstraintsILm3EEENSC_7Actions14TerminatePhaseEEEEEENSD_ISF_LSF_4ENSB_IJN9SelfStart7Actions10InitializeIST_EENSG_5LabelINS2Q_6detail10PhaseStartEEENS2R_18CheckForCompletionINS2V_8PhaseEndEST_EENSG_11AdvanceTimeENS2R_21CheckForOrderIncreaseEN3Cce7Actions17SendNextTimeToCceINS9_18CceWorldtubeTargetEEENS27_19InterpolateToTargetIS36_EENS1B_14ReceiveCCEDataIS9_EENSN_7Actions21ComputeTimeDerivativeIS9_EENS3C_24ApplyBoundaryCorrectionsIS9_EENSG_21RecordTimeStepperDataI10NoSuchTypeEENSG_7UpdateUIS3I_EEN2dg7Actions6FilterIN7Filters11ExponentialILm0EEES20_EENSG_4GotoIS2W_EENS2U_IS2Z_EENS2R_7CleanupES31_S2N_EEEEENSD_ISF_LSF_5ENSB_IJNSB_IJN9observers7Actions27RegisterEventsWithObserversENS27_31RegisterElementWithInterpolatorEEEES2N_EEEEENSD_ISF_LSF_8ENSB_IJNSG_20RunEventsAndTriggersENSG_14ChangeSlabSizeENSB_IJS37_S39_S3B_S3E_S3G_NSB_IJS3J_S3L_EEES3S_EEES31_N12PhaseControl7Actions18ExecutePhaseChangeINSB_IJNS4A_10Registrars14VisitAndReturnI31GeneralizedHarmonicTemplateBaseIS9_ELSF_6EEENS4D_31CheckpointAndExitAfterWallclockIS4G_EEEEEEEEEEEEEEEE9ElementIdILm3EEE28_call_receive_data_marshall8INSN_4Tags36BoundaryCorrectionAndGhostCellsInboxILm3EEESt4pairIS4X_I9DirectionILm3EES4R_ESt5tupleIJ4MeshILm2EESt8optionalISt6vectorIdSaIdEEES58_10TimeStepIdEEEEEvPvS5C_+0x193) [0x1f72a33] Address for addr2line: 0x1f72a33
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(CkDeliverMessageFree+0x21) [0x4930481] Address for addr2line: 0x4930481
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x41) [0x4955f21] Address for addr2line: 0x4955f21
/panfs/ds09/sxs/sma/restart/test/EvolveGhCceKerrSchild(_Z15_processHandlerPvP11CkCoreState+0x359) [0x4937e49] Address for addr2line: 0x4937e49
End shortened stack trace.
Node: 2 Proc: 46
Line: 17 of /home/sma/spectre/src/Parallel/InitializationFunctions.cpp
Function: auto setup_error_handling()::(anonymous class)::operator()() const
Terminated due to an uncaught exception: vector::_M_default_append
The same run proceeds normally on Frontera without failure
Can you do a run on wheeler using EvolveGhKerrSchild
with the same input file (commenting out the CCE parts)
The EvolveGhKerrSchild
one goes normally (debug mode).
The floating point exceptions are being generated by interpolating nans for the time derivatives of the GH variables to the world tube. This will happen in debug mode (and not release mode) if a DataMesh/Variables/Tensor is allocated but not initialized. (In debug mode the allocation initializes the DataMesh to nan in order to catch this problem. In release mode the executable will use whatever values happen to be in memory leading to random behavior). My suspicion is that the culprit is the change the order of step_actions
commit on your branch which moved computing the time derivative to after sending the next time to CCE and interpolating to the target
Indeed, the FPE is gone after I switch the order back. Since the Bjorhus boundary condition is applied within ComputeTimeDerivative
, all CCE calculations need to be done before it so that I can use the Weyl scalar psi0 to complete the boundary condition. In practice, the time derivatives of the GH variables are not used by the CCE component so their random values shouldn't matter.
so if the time derivatives of GH variables are not used by CCE, why are they being passed to CCE?
Bug reports:
Expected behavior:
Current behavior:
I'm trying to run the CCE-GH executable #2323 on wheeler. The GH domain is as follows
and the CCE grid is
Running on 4 nodes, the system proceeds at least one time step per second. However, if I add one more radial or angular grid point for each element (namely
InitialGridPoints: [9,8]
orInitialGridPoints: [8,9]
), the system won't take even a single time step within five minutes. This issue doesn't happen all the time but pretty frequently, and this issue is gone after I request fewer nodes. The CCE grid doesn't have an impact on this issue.Environment:
Using all modules in
wheeler_clang.sh
Detailed discussion: