metrumresearchgroup / bbi

Next generation modeling platform
11 stars 2 forks source link

bbi sometimes errors when submitting with `--overwrite` while model is running #322

Open seth127 opened 1 month ago

seth127 commented 1 month ago

I'm not fully clear what's happening here, but I noticed a strange behavior where submitting a model with --overwrite while that same model is currently running while cause bbi to error out sometimes. Some examples are below, but this seems to be unreliable/flaky, because sometimes I was able to reproduce it and other times it just seemed to work as expected.

I'm not sure if there's anything we can really do here, at least on the bbi side, but I wanted to document so we can look into a little deeper at some point.

As a side comment: this may be an argument for making submit_models() check the bbr::submit_models(..., .overwrite) arg first and deleting directories up front. There is some discussion related to this in comments on bbr#691.

truncated error output from bbr This is where I originally saw, trying to re-run models that had failed or were stuck from a `bbr` bootstrap (working with [unreleased `bbr 1.10.0.8005`](https://github.com/metrumresearchgroup/bbr/releases/tag/1.10.0.8005)). ```r boot_mods <- get_boot_models(boot_run) unfinished <- which(!get_model_status(boot_run)$finished) submit_models( boot_mods[unfinished], .overwrite = TRUE, .config_path = file.path(MODEL_DIR, "bbi.yaml") # needed for manually submitting sub-models ) Submitting 71 models with 1 unique configurations. Error in `map()` at bbr/R/submit-models.R:147:3: ℹ In index: 1. Caused by error in `check_status_code()`: ! `bbi nonmem run sge models/pk/106-boot/032.ctl models/pk/106-boot/036.ctl ... ``` It went on longer than this, but got cut off eventually. From here, I ran the printed `bbi` command directly in the terminal and posted the full output below.
full error output from bbi run in terminal ``` $ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@ip-10-128-0-166.ec2.inte BIP 0/3/4 1.92 lx-amd64 703 0.55500 Run_117 sethg r 05/31/2024 13:40:09 1 716 0.55500 Run_167 sethg r 05/31/2024 13:40:09 1 731 0.55500 Run_189 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-1-204.ec2.inte BIP 0/3/4 2.14 lx-amd64 711 0.55500 Run_157 sethg r 05/31/2024 13:40:09 1 725 0.55500 Run_183 sethg r 05/31/2024 13:40:09 1 739 0.55500 Run_197 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-10-162.ec2.int BIP 0/4/4 1.77 lx-amd64 699 0.55500 Run_113 sethg r 05/31/2024 13:40:09 1 713 0.55500 Run_162 sethg r 05/31/2024 13:40:09 1 727 0.55500 Run_185 sethg r 05/31/2024 13:40:09 1 741 0.55500 Run_199 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-10-199.ec2.int BIP 0/4/4 1.42 lx-amd64 691 0.55500 Run_105 sethg r 05/31/2024 13:40:09 1 702 0.55500 Run_116 sethg r 05/31/2024 13:40:09 1 717 0.55500 Run_169 sethg r 05/31/2024 13:40:09 1 730 0.55500 Run_188 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-23-53.ec2.inte BIP 0/1/4 3.97 lx-amd64 688 0.55500 Run_102 sethg r 05/31/2024 13:39:54 1 --------------------------------------------------------------------------------- all.q@ip-10-128-26-0.ec2.inter BIP 0/4/4 1.63 lx-amd64 697 0.55500 Run_111 sethg r 05/31/2024 13:40:09 1 710 0.55500 Run_156 sethg r 05/31/2024 13:40:09 1 724 0.55500 Run_182 sethg r 05/31/2024 13:40:09 1 738 0.55500 Run_196 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-40-102.ec2.int BIP 0/4/4 1.79 lx-amd64 700 0.55500 Run_114 sethg r 05/31/2024 13:40:09 1 714 0.55500 Run_163 sethg r 05/31/2024 13:40:09 1 728 0.55500 Run_186 sethg r 05/31/2024 13:40:09 1 742 0.55500 Run_200 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-42-172.ec2.int BIP 0/4/4 1.46 lx-amd64 692 0.55500 Run_106 sethg r 05/31/2024 13:40:09 1 706 0.55500 Run_134 sethg r 05/31/2024 13:40:09 1 719 0.55500 Run_174 sethg r 05/31/2024 13:40:09 1 734 0.55500 Run_192 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-43-190.ec2.int BIP 0/4/4 1.50 lx-amd64 694 0.55500 Run_108 sethg r 05/31/2024 13:40:09 1 707 0.55500 Run_151 sethg r 05/31/2024 13:40:09 1 721 0.55500 Run_176 sethg r 05/31/2024 13:40:09 1 735 0.55500 Run_194 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-45-165.ec2.int BIP 0/4/4 1.53 lx-amd64 695 0.55500 Run_109 sethg r 05/31/2024 13:40:09 1 708 0.55500 Run_152 sethg r 05/31/2024 13:40:09 1 722 0.55500 Run_180 sethg r 05/31/2024 13:40:09 1 736 0.55500 Run_193 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-59-142.ec2.int BIP 0/4/4 1.58 lx-amd64 696 0.55500 Run_110 sethg r 05/31/2024 13:40:09 1 709 0.55500 Run_155 sethg r 05/31/2024 13:40:09 1 723 0.55500 Run_178 sethg r 05/31/2024 13:40:09 1 737 0.55500 Run_195 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-61-40.ec2.inte BIP 0/4/4 1.41 lx-amd64 690 0.55500 Run_104 sethg r 05/31/2024 13:40:09 1 701 0.55500 Run_115 sethg r 05/31/2024 13:40:09 1 715 0.55500 Run_164 sethg r 05/31/2024 13:40:09 1 729 0.55500 Run_187 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-61-64.ec2.inte BIP 0/4/4 1.46 lx-amd64 693 0.55500 Run_107 sethg r 05/31/2024 13:40:09 1 705 0.55500 Run_121 sethg r 05/31/2024 13:40:09 1 720 0.55500 Run_177 sethg r 05/31/2024 13:40:09 1 733 0.55500 Run_191 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-63-188.ec2.int BIP 0/3/4 1.95 lx-amd64 704 0.55500 Run_118 sethg r 05/31/2024 13:40:09 1 718 0.55500 Run_173 sethg r 05/31/2024 13:40:09 1 732 0.55500 Run_190 sethg r 05/31/2024 13:40:09 1 --------------------------------------------------------------------------------- all.q@ip-10-128-7-197.ec2.inte BIP 0/4/4 1.73 lx-amd64 698 0.55500 Run_112 sethg r 05/31/2024 13:40:09 1 712 0.55500 Run_160 sethg r 05/31/2024 13:40:09 1 726 0.55500 Run_184 sethg r 05/31/2024 13:40:09 1 740 0.55500 Run_198 sethg r 05/31/2024 13:40:09 1 $ /data/apps/bbi nonmem run sge models/pk/106-boot/032.ctl models/pk/106-boot/036.ctl models/pk/106-boot/058.ctl models/pk/106-boot/067.ctl models/pk/106-boot/071.ctl models/pk/106-boot/086.ctl models/pk/106-boot/087.ctl models/pk/106-boot/088.ctl models/pk/106-boot/090.ctl models/pk/106-boot/091.ctl models/pk/106-boot/092.ctl models/pk/106-boot/093.ctl models/pk/106-boot/098.ctl models/pk/106-boot/099.ctl models/pk/106-boot/100.ctl models/pk/106-boot/101.ctl models/pk/106-boot/102.ctl models/pk/106-boot/103.ctl models/pk/106-boot/104.ctl models/pk/106-boot/105.ctl models/pk/106-boot/106.ctl models/pk/106-boot/107.ctl models/pk/106-boot/108.ctl models/pk/106-boot/109.ctl models/pk/106-boot/110.ctl models/pk/106-boot/111.ctl models/pk/106-boot/112.ctl models/pk/106-boot/113.ctl models/pk/106-boot/114.ctl models/pk/106-boot/115.ctl models/pk/106-boot/116.ctl models/pk/106-boot/117.ctl models/pk/106-boot/118.ctl models/pk/106-boot/121.ctl models/pk/106-boot/134.ctl models/pk/106-boot/151.ctl models/pk/106-boot/152.ctl models/pk/106-boot/155.ctl models/pk/106-boot/156.ctl models/pk/106-boot/157.ctl models/pk/106-boot/160.ctl models/pk/106-boot/162.ctl models/pk/106-boot/163.ctl models/pk/106-boot/164.ctl models/pk/106-boot/167.ctl models/pk/106-boot/169.ctl models/pk/106-boot/173.ctl models/pk/106-boot/174.ctl models/pk/106-boot/176.ctl models/pk/106-boot/177.ctl models/pk/106-boot/178.ctl models/pk/106-boot/180.ctl models/pk/106-boot/182.ctl models/pk/106-boot/183.ctl models/pk/106-boot/184.ctl models/pk/106-boot/185.ctl models/pk/106-boot/186.ctl models/pk/106-boot/187.ctl models/pk/106-boot/188.ctl models/pk/106-boot/189.ctl models/pk/106-boot/190.ctl models/pk/106-boot/191.ctl models/pk/106-boot/192.ctl models/pk/106-boot/193.ctl models/pk/106-boot/194.ctl models/pk/106-boot/195.ctl models/pk/106-boot/196.ctl models/pk/106-boot/197.ctl models/pk/106-boot/198.ctl models/pk/106-boot/199.ctl models/pk/106-boot/200.ctl --overwrite --config=models/pk/bbi.yaml INFO[0000] Successfully loaded specified configuration from models/pk/bbi.yaml INFO[0000] Beginning Local Path INFO[0000] A total of 71 models have been located for work INFO[0000] [032] Beginning SGE work phase INFO[0000] [036] Beginning SGE work phase INFO[0000] [058] Beginning SGE work phase INFO[0000] [067] Beginning SGE work phase INFO[0000] [071] Beginning SGE work phase INFO[0000] [086] Beginning SGE work phase INFO[0000] [087] Beginning SGE work phase INFO[0000] [088] Beginning SGE work phase INFO[0000] [090] Beginning SGE work phase INFO[0000] [091] Beginning SGE work phase INFO[0000] [092] Beginning SGE work phase INFO[0000] [093] Beginning SGE work phase INFO[0000] [098] Beginning SGE work phase INFO[0000] [099] Beginning SGE work phase INFO[0000] [100] Beginning SGE work phase INFO[0000] [101] Beginning SGE work phase INFO[0000] [103] Beginning SGE work phase INFO[0000] [104] Beginning SGE work phase INFO[0000] [102] Beginning SGE work phase INFO[0000] [105] Beginning SGE work phase INFO[0000] [106] Beginning SGE work phase INFO[0000] [108] Beginning SGE work phase INFO[0000] [107] Beginning SGE work phase INFO[0000] [110] Beginning SGE work phase INFO[0000] [109] Beginning SGE work phase INFO[0000] [114] Beginning SGE work phase INFO[0000] [115] Beginning SGE work phase INFO[0000] [116] Beginning SGE work phase INFO[0001] [117] Beginning SGE work phase INFO[0001] [121] Beginning SGE work phase INFO[0001] [134] Beginning SGE work phase INFO[0001] [118] Beginning SGE work phase INFO[0001] [152] Beginning SGE work phase INFO[0002] [164] Beginning SGE work phase INFO[0002] [162] Beginning SGE work phase INFO[0002] [169] Beginning SGE work phase INFO[0002] [167] Beginning SGE work phase INFO[0002] [174] Beginning SGE work phase INFO[0002] [173] Beginning SGE work phase INFO[0002] [180] Beginning SGE work phase INFO[0002] [177] Beginning SGE work phase INFO[0002] [182] Beginning SGE work phase INFO[0003] [183] Beginning SGE work phase INFO[0003] [184] Beginning SGE work phase INFO[0003] [186] Beginning SGE work phase INFO[0003] [187] Beginning SGE work phase INFO[0003] [188] Beginning SGE work phase INFO[0003] [189] Beginning SGE work phase INFO[0003] [192] Beginning SGE work phase INFO[0003] [191] Beginning SGE work phase INFO[0003] [193] Beginning SGE work phase INFO[0003] [195] Beginning SGE work phase INFO[0003] [194] Beginning SGE work phase INFO[0003] [196] Beginning SGE work phase INFO[0003] [197] Beginning SGE work phase INFO[0004] [198] Beginning SGE work phase INFO[0004] [200] Beginning SGE work phase INFO[0004] [199] Beginning SGE work phase INFO[0004] Waiting for any post execution hooks to finish ERRO[0004] 13 errors were experienced during the run ERRO[0004] Errors were experienced while running model 111. Details are unlinkat models/pk/106-boot/111/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 112. Details are unlinkat models/pk/106-boot/112/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 113. Details are unlinkat models/pk/106-boot/113/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 151. Details are unlinkat models/pk/106-boot/151/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 155. Details are unlinkat models/pk/106-boot/155/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 156. Details are unlinkat models/pk/106-boot/156/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 157. Details are unlinkat models/pk/106-boot/157/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 160. Details are unlinkat models/pk/106-boot/160/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 163. Details are unlinkat models/pk/106-boot/163/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 176. Details are unlinkat models/pk/106-boot/176/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 178. Details are unlinkat models/pk/106-boot/178/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 185. Details are unlinkat models/pk/106-boot/185/temp_dir: directory not empty ERRO[0004] Errors were experienced while running model 190. Details are unlinkat models/pk/106-boot/190/temp_dir: directory not empty 71 models completed in 4.211071473s ```
trying (and failing) to reproduce reliably with a single model ``` $ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@ip-10-128-58-134.ec2.int BIP 0/1/4 1.23 lx-amd64 801 0.55500 Run_032 sethg r 05/31/2024 13:43:39 1 --------------------------------------------------------------------------------- sethg@ip-10-128-40-150:$ bbi nonmem run sge /data/home/sethg/models/pk/106-boot/032.ctl --overwrite INFO[0000] Successfully loaded specified configuration from /data/home/sethg/models/pk/bbi.yaml INFO[0000] Beginning Local Path INFO[0000] A total of 1 models have been located for work INFO[0000] [032] Beginning SGE work phase INFO[0000] Waiting for any post execution hooks to finish 1 models completed in 10.411396ms ```
kyleam commented 1 month ago

Details are unlinkat models/pk/106-boot/151/temp_dir: directory not empty

I think that's consistent with something like this:

In this case, process 2 is the run sge call that's removing the output directory in the Prepare step.


It's probably tricky to reliably time things to trigger any given error, but here's one example of an error that can be triggered by the script below (that uses local, but same core problem):

INFO[0000] [1] Beginning local work phase
ERRO[0004] [1] Exit code was 110, details were exit status 110
ERRO[0004] [1] output details were: Starting NMTRAN

 WARNINGS AND ERRORS (IF ANY) FOR PROBLEM    1

 (WARNING  2) NM-TRAN INFERS THAT THE DATA ARE POPULATION.

 LIM VALUES MAXLIM ASSESSED BY NMTRAN: 1,2,3,4,5,6,7,8,10,11,13,15,16

Note: Analytical 2nd Derivatives are constructed in FSUBS but are never used.
      You may insert $ABBR DERIV2=NO after the first $PROB to save FSUBS construction and compilation time

cat: trash.tmp: No such file or directory
Recompiling certain components
cp: cannot create regular file 'temp_dir/': Not a directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/PRED.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/PREDI.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/CHECK.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/SADVAN.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/ADVAN2.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/SSS0.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/TRANS2.o': No such file or directory
cp: cannot stat '/opt/NONMEM/nm75/pr/temp_dir/INFN.o': No such file or directory
Building NONMEM Executable
gfortran: error: temp_dir/SSS0.o: No such file or directory
gfortran: error: temp_dir/INFN.o: No such file or directory
No nonmem execution.
INFO[0004] Waiting for any post execution hooks to finish
ERRO[0004] 1 errors were experienced during the run
ERRO[0004] Errors were experienced while running model 1.mod. Details are Running the programmatic shell script caused an error
script ```sh #!/bin/sh set -eu tdir=$(mktemp -d "${TMPDIR:-/tmp}"/bbi-322-XXXXXXX) cd "$tdir" url_base=https://raw.githubusercontent.com/metrumresearchgroup/bbi/v3.3.0/integration/testdata/acop curl -fSsL "$url_base/acop.mod" >1.mod curl -fSsLO "$url_base/acop.csv" bbi init --dir /opt/NONMEM bbi nonmem run local --nm_version nm75 1.mod >run1.out 2>run1.err & bbi nonmem run local --nm_version nm75 --overwrite 1.mod ```

I'm not sure if there's anything we can really do here, at least on the bbi side, but I wanted to document so we can look into a little deeper at some point.

If you have two bbi processes running and racing against each other, I don't think there's anything to do to avoid all the possible errors.

I'd say the main thing to do is to prevent that in the first place by having bbi write some sort of lock file to the output directory that indicates that it is running and then have it remove the lock after it completes (related to last point here). Any subsequent bbi invocation would refuse to work on a directory with a lock file, giving a clear error. I don't think you're hoping for with --overwrite, but in my opinion aborting if there's another bbi process already working with an output directory is reasonable behavior.

One risk is when bbi dies unexpectedly without a chance to clean that up. Then you have a stale lock file around that would cause a future bbi to refuse to work. bbi could mention in the error message that, if you know another bbi process isn't running, you can manually remove the lock file. (This is similar to what Git does with its index lock.)