egs_chamber unable to initialize more than 30 BEAM sources if they share a common parent phase space

vahx129 commented 7 years ago

I've been struggling with the above jaws Varian IAEA phase spaces in the hopes of achieving TrueBeam VMAT simulations in egs_chamber. This pursuit led me to what I believe to be an egs_chamber issue.

To give some background, my simulations are meant to represent VMAT treatments from our TrueBeam StX linac with HDMLC. In egs_chamber, I'm simulating dose to an A1SL model volume embedded in a cylindrical phantom. The TrueBeam StX BEAMnrc model includes the SYNCJAWS and SYNCHDMLC components downstream of the Varian provided 6x IAEA phase space. The SYNCHDMLC module is "modulated" to create the VMAT dose distributions. Since egs_chamber doesn't support a dynamic VMAT source, I'm simply making each control point into its own static transformed source and then using the egs_chamber composite source routine to simulate the arc.

I've created (NumCP) BEAMnrc input files with each one containing a given MLC pattern and NumCP being the number of control points for that given arc. Every BEAMnrc input file points to the same above jaws phase space which is in IAEA format as provided by Varian for their TrueBeam (I'm specifically using 6x). I then create a composite source in egs_chamber using the control point weights provided by the plan's DICOM header information. The issue I run into seems to have to do with the number of sources being initialized in egs_chamber. Without fail, the simulation will crash after the 30th BEAM source is initialized.

I specifically tested a patient whose plan consisted of 5 arcs. The first 3 arcs had over 100 control points and the last 2 only had 10. Simulation of the last two arcs worked without issue and the tracks qualitatively appeared to be in agreement with the TPS. None of the arcs that had more than 30 control points worked. By cutting out control points, I was able to run simulations where the first 29 control points of arcs 1, 2, and 3 for that patient still worked. So just to be clear, the simulation crashes during the initialization of the various BEAM input files if there are more than 30 of them. Otherwise it will continue on normally and finish with reasonable looking results.

Looking in the BEAM_TrueBeam folder, I see run folders created for the first 30 sources. In the 30th run folder, the .egslog file terminates right before the phase space information would generally pop up. This leads me to believe that somehow the issue has to do with the fact that multiple BEAM sources are all accessing the same phase space file. I'm curious as to why it seems to be able to handle up to 30 simultaneous instances without issue but not more than that.

Raising the value of $max_unit in the egs_utilities.mortran file with successively higher values had no effect. I've recreated this issue with a different BEAM_accelerator (a 21EX model) and was able to reproduce the issue there as well. With the 21EX model, we can get away without needing phase spaces at all since we have the full specifications. If I give egs_chamber hundreds of BEAM input sources that in no way make use of a phase space at all, then I can get very reasonable looking VMAT approximations.

An Odds/Evens fixed I used to get around this issue previously also made no difference unfortunately so I'm not totally sure if this is strictly an I/O type issue or something else.

vahx129 commented 7 years ago

Also just wanted to add an image showing how the 10 control point plans had tracks that closely resembled the TPS

crcrewso commented 7 years ago

If I could, I'd like a copy of all of some sample input files and your module file. I just sorted out some huge issues with our new cluster and now we have no problem running 10B+ histories divided over 36 jobs from 1 source through shared libraries.

vahx129 commented 7 years ago

Absolutely! Thanks!

I've created a zip containing all the relevant files I think you would need. The BEAMnrc inputs need to have the phase space source location updated and I create them in batch using Matlab so if you know the directory of a phase space you'd like them to point to I can send you updated files so that no manual renaming is necessary. Otherwise I've included the linac module, the egs_chamber inputs which shouldn't need any adjusting I don't think, and the PEGS file I use.

The patient file naming scheme is TB_XXX_YYY_NNN_MMM where XXX and YYY are just anonymized patient and plan identifiers so they don't really matter. NNN is the arc and MMM is the control point within the arc. Each BEAM input file has a static MLC configuration and should be self-contained. If this works, this is how I would ultimately try to approximate VMAT plans in egs_chamber anyway.

It probably doesn't matter whether you use the TrueBeam IAEA phase space or just some other one but just in case it's relevant, I wanted to mention again that I used the v2 IAEA TrueBeam 6x from Varian for these simulations.

The link to the files is here: https://drive.google.com/open?id=0B6LPx-cahe65NDZyVWRmVlhfZTQ

Thanks so much for your help and if there's anything else you need please let me know!

crcrewso commented 7 years ago

I'm running through your test files but I had a thought related to scheduling and collisions. Which queuing system are you using to run this on and are you the admin?

vahx129 commented 7 years ago

Our cluster uses the torque system. I myself am not the admin but we have dedicated staff that manages the cluster. They know much more about it than me personally.

It might be worth emphasizing though that this issue I described here specifically was not encountered in a cluster simulation. This was just a regular Windows box.

But in general, my life would be exponentially better if the collision issue with shared libraries and a common phase space parent source could be made compatible with a parallel computing environment. Specifically if egs_chamber could accept multiple BEAMnrc input sources that all reference the same above jaws phase space. I made mention of this in a separate thread though so I didn't want to necessarily derail this one which specifically deals with a hard limit of 30 sources in a non-parallel environment.

crcrewso commented 7 years ago

How are you launching multiple simulations on a single windows box?

vahx129 commented 7 years ago

Sorry about not making that clearer. I'm only submitting a single egs_chamer simulation per beam. So each of those inputs is independent. Within the egs_chamber input file, multiple BEAM simulations are called via shared libraries and all those BEAM simulations call on the same above jaws phase space for particles. So any of the first 3 egs_chamber simulations I sent you should fail if you try to run them on a Windows box. The last two should work. But I'm only running one at a time.

Does that answer your question? Thanks!

crcrewso commented 7 years ago

Yes it does, thank you. I had thought that there were collisions with too many accesses of the same source at the same time but egs_chamber is only getting one particle at a time so there shouldn't be any IO issues there.

vahx129 commented 7 years ago

I would be very interested to see if the issue can be reproduced then with the inputs I sent you. I got it to happen on 2 windows desktops and 1 windows laptop. Other than Windows I can't think of common denominators. One box was running EGS_2016 and the other two were using EGS_2017.

Ultimately I would want this to work on a cluster so if it can be made compatible with our cluster, that would be ideal.

vahx129 commented 7 years ago

Ahh sorry about that. TJM is the pegs file for the beam simulations only. Here's the other PEGS file for those materials

https://drive.google.com/file/d/0B6LPx-cahe65ODRCODZiTXgyQVE/view?usp=drivesdk

crcrewso commented 7 years ago

my results so far:

I tried to cut down the file by only using the first 10 inputs, this worked.

recompiled everything with $MAX_unit=9999 and it successfully loaded 30 inputs before seg faulting

I'm going to have to leave this for the Dev team. If someone would like me to run any tests, as I have a working set on a cluster, please let me know.

One piece of wierdness. I tried to run the test file (first 10 inputs) in parallel and it's leaving the lock file and ptracks around after completion.

vahx129 commented 7 years ago

Good to know that the issue is reproducible at least! In the past I've crashed our cluster by running files like this even if I tried to terminate them almost immediately so I can't comment on how 30 or fewer sources would work on our cluster since I've been scared to try.

You mentioned that you're able to run multiple shared libraries that share a parent source without IO collisions. On our cluster, even a single BEAM source present in an egs_chamber input with nothing else will fail.

I don't want to get too off topic with this specific bug but could you tell me how you got shared libraries working on your cluster without collision issues?

crcrewso commented 7 years ago

I'm thinking we should take this whole conversation off line. Could you either post to my branch or to google +. I am thinking that there might be better ways to get the information you need than using egs_chamber for something it wasn't designed for.

vahx129 commented 7 years ago

Sure - could you reach out to me on hangouts? My email is rayjesconstruct@gmail.com. I could close this entire thread but should it be left open for its original issue regarding the 30 sources?

I'd be interested in alternative ways to obtain chamber-specific IMRT corrections if they do exist. I'm wary of using DOSXYZ because of voxel size limitations and like the speed and geometry flexibility of egs_chamber.

vahx129 commented 7 years ago

I wanted to add another perhaps related bit of information to this thread in that I'm also having issues trying to run egs_chamber simulations that contain multiple VMAT arcs and therefore several hundred static control points.

For whatever reason, the magic number I appear to be running into is 202 BEAM sources. For most arcs which contain less than 200 control points, these simulations work fine so long as no phase space is involved anywhere in the process. If I wanted to simulate two arcs or more simultaneously however, the simulation will terminate after initializing the 202nd BEAM source and not provide any explicit error messages.

The only value I've tried adjusting to compensate is max unit without any luck. I wonder if there are more fundamental issues with simultaneously trying to deal with so many concurrent shared library sources at once?

nrc-cnrc / EGSnrc

egs_chamber unable to initialize more than 30 BEAM sources if they share a common parent phase space #311