Open devinvanelburg opened 2 years ago
@rtownson @ftessier I've helped Devin with this. I can reproduce the crash on my personal laptop. Here is a screenshot of the stack trace:
Looks like a bug in the electron routine. I cannot get gdb working on my laptop, so unfortunately, I can't inspect the variables. Happy to assist with troubleshooting.
@devinvanelburg What system are you running your simulations on, by the way? Linux?
Both systems I tried were Linux based. One a university cluster (Redhat I believe) and the other Ubuntu through WSL.
How many media are there in the simulation? I am wondering if this errors simply stems from a $MXMED
macro that is not set high enough. Trying the simple ideas first!
Should be seven in the patient CT, then the seed materials if that matters.
Patient: bladder, rectum, urethra and applicator material, then the rest is the F tissue ramp in eb_GUI, so air, soft tissue, cortical bone.
@ftessier oooh! My egs++ $MXMED is set to 50, but the one in egsnrc.macros is set to 10. It's been so long since I've dealt with this, I totally forgot about it. Let me try again with the egsnrc.macros increased to 50.
@ftessier nope, that wasn't it: there are only 8 media defined in the simulation.
I also tried setting MXMED to 50 and recompiled egs_brachy. Same error.
Is the issue related to the enormous size of the phantom + memory? Are you only scoring dose via collision kerma approximation (default) or other quantities (e.g. interaction scoring dose) as well? You might only do default scoring and see if it works. Also, you could try cropping your egsphant down to a much smaller size and seeing if it runs (eb_gui may allow cropping, too)
@rfmthoms it does use an enormous amount of memory (24 GB on my laptop!), but I wouldn't expect it to crash in this way during the simulation if it ran out of memory. Yes, this is only scoring collision kerma. Most options are pretty standard. It's running in egs_brachy's superposition mode, but I wouldn't expect it to cause an issue to show up in the electron transport routine...
Thanks for checking. I don't have egs_brachy running here, so can't help at the moment. It would be useful to know if this is an out-of-bounds access, and whether it is on lelkef+1
or medium
. Since the lelkef index has to do with energy interpolation, also check that energy thresholds and cutoff AE, UE, ECUT
are properly set; in particular ensure that UE
is strictly above the source energy. Beyond that I need to inspect in the debugger...
@ftessier Yes, I actually suspect AE. UE shenanigans... Let me look closer at the input file.
Also, I thought there were warnings added when something was not consistent in those parameters or am I misremembering?
Yes, but there can be a trap, e.g., energy test falls on one side of the good float side for the warning, then the wrong side for the array index? I am widely speculating here! 😄
The AE and UE are both set to 2.010. That should be fine considering the 192Ir max energy is 1.377 MeV (+0.511 MeV for electrons), right?
However I noticed that the MC transport parameters was set to low energy default, so Global ECUT and Source ECUT were set to 1.512. Could this be the cause of the problem? I now switched it to high energy default (2.012 ECUTs), we'll see if this crashes too.
Unlikely, given that ECUT
is normally reset to AE
if ECUT < AE
. I am more worried about AE
set the same as UE
, in terms of the interpolation tables that EGSnrc builds...
Oh oops, I should clarify the settings are:
AE = 2.010 UE = 2.011
How many media are there in the simulation? I am wondering if this errors simply stems from a $MXMED macro that is not set high enough. Trying the simple ideas first!
In the same vein, has $MXREG
been adjusted accordingly? I don't know if it's the cause of this error but the default is fairly low (2000).
@mxxo that’a an interesting one. I’ve run hundreds of egs++ simulations with 250,000+ regions and was never aware of this $MXREG
, so I’ve always used this default.
Worth looking into, though!
Me too (15e6 regions recently). I can only guess that $MXREG
is only effective in mortran applications? But now I am curious about that too!
Unlikely, given that ECUT is normally reset to AE if ECUT < AE. I am more worried about AE set the same as UE, in terms of the interpolation tables that EGSnrc builds...
Hmm, I switched the Global and Source ECUTs to the high energy default (2.012) and it appears to be working. Currently running the simulation and it's completed 2/10 of the 10e7 batches so far, whereas before it was crashing near the end of the first batch.
@ftessier 🥇for you. Setting AE and UE to more reasonable values (0.512, 2.012) seems to resolve the issue. Thanks for the help!
Thanks, happy that we figured this out!
A stack trace when running egs_brachy through a debugger gives the following error (occurs on the order of 10e6 histories into the simulation):
This occurs for Ir192 HDR simulations with huge patient CTs (1024 x 1024 x ~200), generated egsphants and egsinp files in eb_GUI. Google Drive link to necessary input files (patient is anonymized): https://drive.google.com/file/d/1__mFJ0-W2YkqDA666_1qMyS2UfqztLCZ/view?usp=sharing