parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
112 stars 34 forks source link

add runtime option to enable FPE traps #953

Open BenWibking opened 1 year ago

BenWibking commented 1 year ago

It would be nice to have a runtime option to enable FPE traps.

There is no portable way to do this, but all of the common cases should be covered by something like this: https://github.com/AMReX-Codes/amrex/blob/77d4d1fe5ce68a1e71095093ce856e061f24fc07/Src/Base/AMReX.cpp#L543

Yurlungur commented 1 year ago

:+1: this would be a useful feature to have.

BenWibking commented 1 year ago

I can probably create a PR for this tomorrow.

BenWibking commented 12 months ago

My prototype code doesn't work on my Apple Silicon device due to an FPE trap in an Apple-provided library:

* thread #1, queue = 'com.Metal.DeviceDispatch', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e220800)
    frame #0: 0x00000001ef7a7f78 AGXMetalG13X`AGX::SamplerStateEncoderGen4<AGX::G13::TextureFormatTable>::SamplerStateFields::SamplerStateFields(AGX::SamplerDescriptor const&) + 128
AGXMetalG13X`AGX::SamplerStateEncoderGen4<AGX::G13::TextureFormatTable>::SamplerStateFields::SamplerStateFields:
->  0x1ef7a7f78 <+128>: fmul   s0, s0, s2
    0x1ef7a7f7c <+132>: mov    w14, #0x44600000          ; =1147142144 
    0x1ef7a7f80 <+136>: fmov   s1, w14
    0x1ef7a7f84 <+140>: fmin   s1, s0, s1
(lldb) bt
* thread #1, queue = 'com.Metal.DeviceDispatch', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e220800)
  * frame #0: 0x00000001ef7a7f78 AGXMetalG13X`AGX::SamplerStateEncoderGen4<AGX::G13::TextureFormatTable>::SamplerStateFields::SamplerStateFields(AGX::SamplerDescriptor const&) + 128
    frame #1: 0x00000001ef797cc4 AGXMetalG13X`-[AGXG13XFamilyDevice initWithAcceleratorPort:simultaneousInstances:] + 2516
    frame #2: 0x00000001ef79bff8 AGXMetalG13X`-[AGXG13XDevice initWithAcceleratorPort:] + 52
    frame #3: 0x000000019358b358 Metal`-[MTLIOAccelService initWithAcceleratorPort:] + 368
    frame #4: 0x000000019358b1b8 Metal`+[MTLIOAccelService registerService:] + 128
    frame #5: 0x00000001892cd910 libdispatch.dylib`_dispatch_client_callout + 20
    frame #6: 0x00000001892dccc4 libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 56
    frame #7: 0x00000001936d5dd4 Metal`MTLRegisterDevices + 284
    frame #8: 0x00000001935b4290 Metal`invocation function for block in MTLDeviceArrayInitialize() + 1300
    frame #9: 0x00000001892cd910 libdispatch.dylib`_dispatch_client_callout + 20
    frame #10: 0x00000001892cf14c libdispatch.dylib`_dispatch_once_callout + 32
    frame #11: 0x000000019358af2c Metal`MTLCopyAllDevices + 244
    frame #12: 0x0000000101e321c4 AppleMetalOpenGLRenderer`GLDDeviceRec::initWithDisplayMask(unsigned int) + 140
    frame #13: 0x0000000101e37a50 AppleMetalOpenGLRenderer`gldCreateDevice + 72
    frame #14: 0x00000001f11023b0 libGFXShared.dylib`gfxInitializeLibrary + 1900
    frame #15: 0x00000001f14a1ff8 OpenCL`___lldb_unnamed_symbol1212 + 440
    frame #16: 0x0000000189476dfc libsystem_pthread.dylib`__pthread_once_handler + 76
    frame #17: 0x00000001894a6ea0 libsystem_platform.dylib`_os_once_callout + 32
    frame #18: 0x0000000189476d94 libsystem_pthread.dylib`pthread_once + 100
    frame #19: 0x00000001f14a1dbc OpenCL`___lldb_unnamed_symbol1209 + 116
    frame #20: 0x00000001f146bdc4 OpenCL`clGetDeviceIDs + 216
    frame #21: 0x0000000100d4aff0 libhwloc.15.dylib`hwloc_opencl_discover + 220
    frame #22: 0x0000000100d2bc9c libhwloc.15.dylib`hwloc_discover_by_phase + 68
    frame #23: 0x0000000100d2b728 libhwloc.15.dylib`hwloc_topology_load + 1592
    frame #24: 0x000000010100f224 libopen-pal.40.dylib`opal_hwloc_base_get_topology + 4220
    frame #25: 0x0000000100f58d38 libopen-rte.40.dylib`orte_ess_base_proc_binding + 3468
    frame #26: 0x000000010093735c mca_ess_singleton.so`rte_init + 5036
    frame #27: 0x0000000100f8a9d0 libopen-rte.40.dylib`orte_init + 676
    frame #28: 0x0000000100ea0670 libmpi.40.dylib`ompi_mpi_init + 912
    frame #29: 0x0000000100e1d720 libmpi.40.dylib`MPI_Init + 120
    frame #30: 0x000000010031e290 athenaPK`parthenon::ParthenonManager::ParthenonInitEnv(this=0x000000016fdfe510, argc=3, argv=0x000000016fdfe808) at parthenon_manager.cpp:51:22 [opt]
    frame #31: 0x00000001000040b8 athenaPK`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:111:30 [opt]
    frame #32: 0x0000000189101058 dyld`start + 2224

I assume this cannot be my fault...?

BenWibking commented 12 months ago

I think this is a hwloc bug, but this is the workaround: https://kirija.github.io/blog-post-1/.

This gets past the hwloc/OpenMPI bug, but then does not allow examining the program state:

cycle=0 time=0.0000000000000000e+00 dt=4.7253248290644695e-01 zone-cycles/wsec_step=0.00e+00 wsec_total=2.63e+00 wsec_step=2.74e+02
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node Bens-MacBook-Pro exited on signal 4 (Illegal instruction: 4).
--------------------------------------------------------------------------
Process 44268 exited with status = 132 (0x00000084)