Segmentation fault -> EXC_BAD_ACCESS

devinvanelburg commented 2 years ago

A stack trace when running egs_brachy through a debugger gives the following error (occurs on the order of 10e6 histories into the simulation):

      1This version of LLDB has no plugin for the language "fortran90". Inspection of frame variables will be limited.
Process 82358 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfffffffd008c5250)
    frame #0: 0x00000001001172b6 egs_brachy`electr_ at egsnrc_debug.F:11929:72
   11926                                  END IF
   11927                                  aux = aux*(1+2*aux)*(fedep/(2-fedep))**2/6
   11928                                  tuss = fedep*eke*dedxmid*(1+aux)
-> 11929                                 ekei = E_array(lelkef+1,medium)
   11930                                  elkei = (lelkef + 1 - eke0(medium))/eke1(medium)
   11931                                  fedep = 1 - ekef/ekei
   11932                                  elktmp = 0.5*(elkei+elkef+0.25*fedep*fedep*(1+fede
Target 0: (egs_brachy) stopped.

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xfffffffd008c5250)
  * frame #0: 0x00000001001172b6 egs_brachy`electr_ at egsnrc_debug.F:11929:72
    frame #1: 0x0000000100154a37 egs_brachy`::egs_shower_() at egs_interface2.c:95:29
    frame #2: 0x00000001000e09d2 egs_brachy`EGS_AdvancedApplication::shower(this=0x00007ff7bfefe9f8) at egs_advanced_application.cpp:873:5
    frame #3: 0x0000000100018495 egs_brachy`EB_Application::simulateSingleShower(this=0x00007ff7bfefe9f8) at egs_brachy.cpp:2117:15
    frame #4: 0x0000000105ccffe2 libegspp.dylib`EGS_Application::runSimulation(this=0x00007ff7bfefe9f8) at egs_application.cpp:894:21
    frame #5: 0x0000000100017fd6 egs_brachy`EB_Application::runSimulation(this=0x00007ff7bfefe9f8) at egs_brachy.cpp:2038:35
    frame #6: 0x000000010001acaa egs_brachy`main(argc=3, argv=0x00007ff7bfeff420) at egs_brachy.cpp:2554:1
    frame #7: 0x000000010588551e dyld`start + 462

This occurs for Ir192 HDR simulations with huge patient CTs (1024 x 1024 x ~200), generated egsphants and egsinp files in eb_GUI. Google Drive link to necessary input files (patient is anonymized): https://drive.google.com/file/d/1__mFJ0-W2YkqDA666_1qMyS2UfqztLCZ/view?usp=sharing

mchamberland commented 2 years ago

@rtownson @ftessier I've helped Devin with this. I can reproduce the crash on my personal laptop. Here is a screenshot of the stack trace:

Screen Shot 2022-07-06 at 22 53 51

Looks like a bug in the electron routine. I cannot get gdb working on my laptop, so unfortunately, I can't inspect the variables. Happy to assist with troubleshooting.

mchamberland commented 2 years ago

@devinvanelburg What system are you running your simulations on, by the way? Linux?

devinvanelburg commented 2 years ago

Both systems I tried were Linux based. One a university cluster (Redhat I believe) and the other Ubuntu through WSL.

ftessier commented 2 years ago

How many media are there in the simulation? I am wondering if this errors simply stems from a $MXMED macro that is not set high enough. Trying the simple ideas first!

devinvanelburg commented 2 years ago

Should be seven in the patient CT, then the seed materials if that matters.

Patient: bladder, rectum, urethra and applicator material, then the rest is the F tissue ramp in eb_GUI, so air, soft tissue, cortical bone.

mchamberland commented 2 years ago

@ftessier oooh! My egs++ $MXMED is set to 50, but the one in egsnrc.macros is set to 10. It's been so long since I've dealt with this, I totally forgot about it. Let me try again with the egsnrc.macros increased to 50.

mchamberland commented 2 years ago

@ftessier nope, that wasn't it: there are only 8 media defined in the simulation.

devinvanelburg commented 2 years ago

I also tried setting MXMED to 50 and recompiled egs_brachy. Same error.

rfmthoms commented 2 years ago

Is the issue related to the enormous size of the phantom + memory? Are you only scoring dose via collision kerma approximation (default) or other quantities (e.g. interaction scoring dose) as well? You might only do default scoring and see if it works. Also, you could try cropping your egsphant down to a much smaller size and seeing if it runs (eb_gui may allow cropping, too)

mchamberland commented 2 years ago

@rfmthoms it does use an enormous amount of memory (24 GB on my laptop!), but I wouldn't expect it to crash in this way during the simulation if it ran out of memory. Yes, this is only scoring collision kerma. Most options are pretty standard. It's running in egs_brachy's superposition mode, but I wouldn't expect it to cause an issue to show up in the electron transport routine...

ftessier commented 2 years ago

Thanks for checking. I don't have egs_brachy running here, so can't help at the moment. It would be useful to know if this is an out-of-bounds access, and whether it is on lelkef+1 or medium. Since the lelkef index has to do with energy interpolation, also check that energy thresholds and cutoff AE, UE, ECUT are properly set; in particular ensure that UE is strictly above the source energy. Beyond that I need to inspect in the debugger...

mchamberland commented 2 years ago

@ftessier Yes, I actually suspect AE. UE shenanigans... Let me look closer at the input file.

Also, I thought there were warnings added when something was not consistent in those parameters or am I misremembering?

ftessier commented 2 years ago

Yes, but there can be a trap, e.g., energy test falls on one side of the good float side for the warning, then the wrong side for the array index? I am widely speculating here! 😄

devinvanelburg commented 2 years ago

The AE and UE are both set to 2.010. That should be fine considering the 192Ir max energy is 1.377 MeV (+0.511 MeV for electrons), right?

However I noticed that the MC transport parameters was set to low energy default, so Global ECUT and Source ECUT were set to 1.512. Could this be the cause of the problem? I now switched it to high energy default (2.012 ECUTs), we'll see if this crashes too.

ftessier commented 2 years ago

Unlikely, given that ECUT is normally reset to AE if ECUT < AE. I am more worried about AE set the same as UE, in terms of the interpolation tables that EGSnrc builds...

devinvanelburg commented 2 years ago

Oh oops, I should clarify the settings are:

AE = 2.010 UE = 2.011

mxxo commented 2 years ago

How many media are there in the simulation? I am wondering if this errors simply stems from a $MXMED macro that is not set high enough. Trying the simple ideas first!

In the same vein, has $MXREG been adjusted accordingly? I don't know if it's the cause of this error but the default is fairly low (2000).

https://github.com/nrc-cnrc/EGSnrc/blob/a6fc389c6465c949645380976f38a013a639759c/HEN_HOUSE/src/egsnrc.macros#L180

mchamberland commented 2 years ago

@mxxo that’a an interesting one. I’ve run hundreds of egs++ simulations with 250,000+ regions and was never aware of this $MXREG, so I’ve always used this default.

Worth looking into, though!

ftessier commented 2 years ago

Me too (15e6 regions recently). I can only guess that $MXREG is only effective in mortran applications? But now I am curious about that too!

devinvanelburg commented 2 years ago

Unlikely, given that ECUT is normally reset to AE if ECUT < AE. I am more worried about AE set the same as UE, in terms of the interpolation tables that EGSnrc builds...

Hmm, I switched the Global and Source ECUTs to the high energy default (2.012) and it appears to be working. Currently running the simulation and it's completed 2/10 of the 10e7 batches so far, whereas before it was crashing near the end of the first batch.

mchamberland commented 2 years ago

@ftessier 🥇for you. Setting AE and UE to more reasonable values (0.512, 2.012) seems to resolve the issue. Thanks for the help!

ftessier commented 2 years ago

Thanks, happy that we figured this out!

nrc-cnrc / EGSnrc

Segmentation fault -> EXC_BAD_ACCESS #892