Closed kellijohnson-NOAA closed 2 years ago
I agree and good to revisit later. I wonder if couldn't just have a list with parameter or derived quantity names that then populates that we then populates either keyposteriors or nusianceposteriors or similar. Good idea. This will be particularly beneficial when moving to all (or most) MCMC runs when using adnuts.
You know it might be as simple as adding these parameters to the extra reporting section of the control file but I am not sure
This is big on the wish list. The code for running and maintaining the extra mcmc is large and awkward. It would be nice to do away with it.
@Cole-Monnahan-NOAA is also interested in this workflow as it relates to applied use of adnuts, so we should keep him in the loop also
@aaronmberger-nwfsc Thanks for looping me in here. Let's have a more formal discussion later. One quick note. As I understand it, you take a row from the .psv file and run it in "MLE" mode to get a Report.sso file. You're accounting for the bias correction difference between MLE/MCMC modes right? That would lead to the wrong predictions.
I had a conversation with Teresa A'mar (@amart) about this years ago and we discussed two alternative approaches to building in the calculation of additional quantities from MCMC into SS.
1. Adding all the needed output to files that got written automatically in the -mceval phase The problem here is that it's easy to think of additional things that you might want to get from the mcmc output. Right now, I think the list includes
Which adds up to a few thousand quantities. I don't think it's good to include these in the extra reporting section of the control file (I think index fits and numbers-at-age for an individual year are already there) because those quantities become sdreport variables and slow down calculations after inverting the Hessian in MLE mode. Nevertheless, it seems feasible to add an option for the list above to be reported in posteriors_vectors.sso
or a new output file created during -mceval.
2. Adding a switch to automatically generate the report file for each mcmc sample This probably would not speed things up but would save the extra steps in the hake workflow of mapping the posteriors.sso rows into ss.par and then iteratively re-running the model and renaming the output. This was the path we preferred at the time and for a while the starter file listed these options:
0 # MCMC output detail (0=default; 1=obj func components; 2=expanded; 3=make output subdir for each MCMC vector)
But only options 0 and 1 ever got implemented, so the list has since been reduced to reflect that.
@Cole-Monnahan-NOAA regarding bias adjustment, it looks like that is indeed sorted out here: https://github.com/pacific-hake/hake-assessment/blob/master/R/extra-mcmc.R#L57-L69
Another option would be command line switch to include Q in the output during the mceval phase. This should be fairly straightforward to implement in the TPL (no dealing with control file parsing, etc.)
@iantaylor-NOAA - I am not sure if the bias adjustment is actually fully dealt with because when you look at the resulting likelihood values in the MLE context from each run they have non-zero values for the difference in likelihoods between Recruitment and NoBias_corr_Recruitment(info_only).
This topic is currently being talked about on the SS developers vlab website under #75754 (referencing this for personal trace-ability in the future, but the forum is private).
@kellijohnson-NOAA, regarding bias adjustment, I just compared this likelihood equation from Methot and Taylor (2011)
to the code in the SS TPL file that calculates NoBias_corr_Recruitment(info_only)
:
noBias_recr_like=recr_like-sd_offset_rec*log(sigmaR)+(recdev_end-recdev_first+1.)*log(sigmaR);
It looks to me like that NoBias_corr_Recruitment removes the ln(sigmaR) term from the likelihood entirely, whereas in MCMC that term is included and the b_y values are all equal to 1.0.
I should note that I've never actually used the NoBias_corr_Recruitment(info_only)
value, but I believe that it is for Jim Thorson's Laplace approximation calculations as discussed in https://github.com/r4ss/r4ss/issues/44.
@cgrandin calculated 35 minutes for 2000 samples, now we are trying to use system('xcopy ...')
to move files to see if that is faster than file.copy()
xcopy
was slower so now I am trying to split the posteriors up based on number of cores and then use the start command in batch files to run the MLEs in parallel processes. Within R in Windows you cannot run outside processes in parallel so this is kind of a hack but if it works it should speed things up quite a lot.
within ss3sim we run SS simultaneously using multiple cores
On Wed, Mar 4, 2020 at 10:58 AM Chris Grandin notifications@github.com wrote:
xcopy was slower so now I am trying to split the posteriors up based on number of cores and then use the start command in batch files to run the MLEs in parallel processes. Within R in Windows you cannot run outside processes in parallel so this is kind of a hack but if it works it should speed things up quite a lot.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pacific-hake/hake-assessment/issues/656?email_source=notifications&email_token=AA7LCFEYKTM2LKHA24DXH2DRF2QEPA5CNFSM4KRRE6D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZTJPI#issuecomment-594752701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LCFEYSNMEPYJPROLKN3LRF2QEPANCNFSM4KRRE6DQ .
-- Kelli Faye Johnson, PhD Research Fish Biologist Northwest Fisheries Science Center National Marine Fisheries Service 2725 Montlake Boulevard East Seattle, Washington 98112 (206) 860-3490 kelli.johnson@noaa.gov
Yea you can do it from inside a program (i.e. parallelism is compiled into the executable) you just can't run already-compiled executables in parallel in Windows. There has no fork() command in the Windows OS. It's seen as a huge downfall of Windows. We should all be using Linux/MAC
I don't think this is strictly true. adnuts accomplishes this with snowfall by opening new R terminals, copying the ss.exe into temporary folders and running them in parallel (here and here for the curious). There's nothing special about it. Rstan does the same thing.
It should be easy to recreate this on the processing end I think... just chunk up the .psv and split it off into separate SS calls, read in files in each process, then merge them. It's that last merging that could be complicated depending on the output file structure. But trivial for flat files.
On Sean's advice I used the furrr::future_map()
function with future::plan("multisession")
in Windows to get this to work.
The extra-mcmc runs now take only 5 minutes for 2,000 as opposed to 35 (with 7 cores).
I'll also parallelize the forecasting, catch-levels, and retrospective routines. I need to clean up the code before pushing but I wanted you all to know I think we can keep this stuff within the hake assessment project for now.
Following up on my earlier statements about fork()
: This is true, you cannot spawn child processes in Windows. This method using the future package spawns new R sessions on the machine, which are not children of the original R session so it is allowed. It makes it more difficult to exchange data between them but it works well for something like this where you are running an executable which generates output files.
Update for 2022: a new version of Stock Synthesis will be available that stores all of the Report.sso files from MCMC sampling in a directory. This will be available for testing in December 2022.
Check with Rick to see if the derived quantity of q can be reported to Report.sso.
@kellijohnson-NOAA
The ss_output.par
files are not used, so no need for those. The CompReport.sso
files contain Composition Database which we need for the mcmc -at-age tables. If that could appear in the Report file then we wouldn't need the CompReport files.
Catchability is coming from the report file (INDEX_2) and not from the par file.
The executable to test is located on the drive here. Please test as soon as possible so I can get feedback to Rick.
I have compiled and run this SS at this commit:
https://github.com/nmfs-stock-synthesis/stock-synthesis/commit/c050381406d6d588d1cb5fa4ae09bc9d37cabe79
and all the posteriors.sso
, derived_posteriors.sso
, etc. are now populated and all "extra-mcmc" files are there.
The posteriors.sso
, derived_posteriors.sso
, etc. files are still in the sso
folder though when they should be outside it in its parent directory for reproducibility. I have no doubt that will break a lot of people's code.
I still have to build the doc with the new output before saying it's fine but this is a step forward.
@cgrandin you can also opt to put all of the Report_ files in the main folder by not creating a sso
folder. The thought of the sso
folder is that people can delete every output file with one easy delete command. So, I think it might be easier to expect all output files to be in the original folder for us rather than creating sso
?
OK, that's fine.. I was just thinking anyone who uses SS (if this is merged into main) may not expect that to happen and break their reading-in code. We can totally work with it though for hake. I did make a comment on the SS GitHub issue about it. Hope that was ok to do.
The MCMC results from last years assessment model that used SS version (#V3.30.16.03;_2020_10_19) compared to the same model using SS version (#V3.30.xx.yy;_safe;_compile_date:_Dec 23 2021) produce very similar (near exact) estimates and derived quantities.
The only large difference was in the total likelihood. The contribution from Age_comps is considerably different (see inserted snippet), where the latest SS version (not released yet) was considerably greater (right side of insert). Is this related to a recent change in the Dirichlet multinomial likelihood?
working good enough for 2022, potentially refine in 2023 with smaller .rds files and better file management via a reduced report.sso file using options in the starter.ss file.
Some of the parameters are not reported in the posteriors.sso file, so lots of "extra" work is done for the assessment that takes quite a bit of time. Some of this extra stuff can be moved to the SS framework to save post-processing time. @iantaylor-NOAA it was suggested to contact you first to see if you had made an previous progress on this and what can actually be moved to SS?