extra-mcmc -- move some burden to SS3

kellijohnson-NOAA commented 4 years ago

Some of the parameters are not reported in the posteriors.sso file, so lots of "extra" work is done for the assessment that takes quite a bit of time. Some of this extra stuff can be moved to the SS framework to save post-processing time. @iantaylor-NOAA it was suggested to contact you first to see if you had made an previous progress on this and what can actually be moved to SS?

[x] see what tasks can be moved
[x] develop SS3
[x] test changes to SS3
- [x] @aaronmberger-nwfsc will test burn in with adnuts
- [x] @kellijohnson-NOAA will test general use
- [x] @cgrandin will test use of executable with workflow
[x] develop hake-assessment workflow to use new executable

aaronmberger-nwfsc commented 4 years ago

I agree and good to revisit later. I wonder if couldn't just have a list with parameter or derived quantity names that then populates that we then populates either keyposteriors or nusianceposteriors or similar. Good idea. This will be particularly beneficial when moving to all (or most) MCMC runs when using adnuts.

kellijohnson-NOAA commented 4 years ago

You know it might be as simple as adding these parameters to the extra reporting section of the control file but I am not sure

cgrandin commented 4 years ago

This is big on the wish list. The code for running and maintaining the extra mcmc is large and awkward. It would be nice to do away with it.

aaronmberger-nwfsc commented 4 years ago

@Cole-Monnahan-NOAA is also interested in this workflow as it relates to applied use of adnuts, so we should keep him in the loop also

Cole-Monnahan-NOAA commented 4 years ago

@aaronmberger-nwfsc Thanks for looping me in here. Let's have a more formal discussion later. One quick note. As I understand it, you take a row from the .psv file and run it in "MLE" mode to get a Report.sso file. You're accounting for the bias correction difference between MLE/MCMC modes right? That would lead to the wrong predictions.

iantaylor-NOAA commented 4 years ago

I had a conversation with Teresa A'mar (@amart) about this years ago and we discussed two alternative approaches to building in the calculation of additional quantities from MCMC into SS.

1. Adding all the needed output to files that got written automatically in the -mceval phase The problem here is that it's easy to think of additional things that you might want to get from the mcmc output. Right now, I think the list includes

Matrix of numbers-at-age for all years
Matrix of selectivity at age for all years
Expected fit to index observations for all years within the range of the index observations
Total biomass for all years
Summary (age 2+) biomass for all years

Which adds up to a few thousand quantities. I don't think it's good to include these in the extra reporting section of the control file (I think index fits and numbers-at-age for an individual year are already there) because those quantities become sdreport variables and slow down calculations after inverting the Hessian in MLE mode. Nevertheless, it seems feasible to add an option for the list above to be reported in posteriors_vectors.sso or a new output file created during -mceval.

2. Adding a switch to automatically generate the report file for each mcmc sample This probably would not speed things up but would save the extra steps in the hake workflow of mapping the posteriors.sso rows into ss.par and then iteratively re-running the model and renaming the output. This was the path we preferred at the time and for a while the starter file listed these options:

0 # MCMC output detail (0=default; 1=obj func components; 2=expanded; 3=make output subdir for each MCMC vector)

But only options 0 and 1 ever got implemented, so the list has since been reduced to reflect that.

iantaylor-NOAA commented 4 years ago

@Cole-Monnahan-NOAA regarding bias adjustment, it looks like that is indeed sorted out here: https://github.com/pacific-hake/hake-assessment/blob/master/R/extra-mcmc.R#L57-L69

cgrandin commented 4 years ago

Another option would be command line switch to include Q in the output during the mceval phase. This should be fairly straightforward to implement in the TPL (no dealing with control file parsing, etc.)

kellijohnson-NOAA commented 4 years ago

@iantaylor-NOAA - I am not sure if the bias adjustment is actually fully dealt with because when you look at the resulting likelihood values in the MLE context from each run they have non-zero values for the difference in likelihoods between Recruitment and NoBias_corr_Recruitment(info_only).

kellijohnson-NOAA commented 4 years ago

This topic is currently being talked about on the SS developers vlab website under #75754 (referencing this for personal trace-ability in the future, but the forum is private).

iantaylor-NOAA commented 4 years ago

@kellijohnson-NOAA, regarding bias adjustment, I just compared this likelihood equation from Methot and Taylor (2011) to the code in the SS TPL file that calculates NoBias_corr_Recruitment(info_only):

      noBias_recr_like=recr_like-sd_offset_rec*log(sigmaR)+(recdev_end-recdev_first+1.)*log(sigmaR);

It looks to me like that NoBias_corr_Recruitment removes the ln(sigmaR) term from the likelihood entirely, whereas in MCMC that term is included and the b_y values are all equal to 1.0.

I should note that I've never actually used the NoBias_corr_Recruitment(info_only) value, but I believe that it is for Jim Thorson's Laplace approximation calculations as discussed in https://github.com/r4ss/r4ss/issues/44.

kellijohnson-NOAA commented 4 years ago

@cgrandin calculated 35 minutes for 2000 samples, now we are trying to use system('xcopy ...') to move files to see if that is faster than file.copy()

cgrandin commented 4 years ago

xcopy was slower so now I am trying to split the posteriors up based on number of cores and then use the start command in batch files to run the MLEs in parallel processes. Within R in Windows you cannot run outside processes in parallel so this is kind of a hack but if it works it should speed things up quite a lot.

kellijohnson-NOAA commented 4 years ago

within ss3sim we run SS simultaneously using multiple cores

On Wed, Mar 4, 2020 at 10:58 AM Chris Grandin notifications@github.com wrote:

xcopy was slower so now I am trying to split the posteriors up based on number of cores and then use the start command in batch files to run the MLEs in parallel processes. Within R in Windows you cannot run outside processes in parallel so this is kind of a hack but if it works it should speed things up quite a lot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pacific-hake/hake-assessment/issues/656?email_source=notifications&email_token=AA7LCFEYKTM2LKHA24DXH2DRF2QEPA5CNFSM4KRRE6D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENZTJPI#issuecomment-594752701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LCFEYSNMEPYJPROLKN3LRF2QEPANCNFSM4KRRE6DQ .

-- Kelli Faye Johnson, PhD Research Fish Biologist Northwest Fisheries Science Center National Marine Fisheries Service 2725 Montlake Boulevard East Seattle, Washington 98112 (206) 860-3490 kelli.johnson@noaa.gov

cgrandin commented 4 years ago

Yea you can do it from inside a program (i.e. parallelism is compiled into the executable) you just can't run already-compiled executables in parallel in Windows. There has no fork() command in the Windows OS. It's seen as a huge downfall of Windows. We should all be using Linux/MAC

Cole-Monnahan-NOAA commented 4 years ago

I don't think this is strictly true. adnuts accomplishes this with snowfall by opening new R terminals, copying the ss.exe into temporary folders and running them in parallel (here and here for the curious). There's nothing special about it. Rstan does the same thing.

It should be easy to recreate this on the processing end I think... just chunk up the .psv and split it off into separate SS calls, read in files in each process, then merge them. It's that last merging that could be complicated depending on the output file structure. But trivial for flat files.

cgrandin commented 4 years ago

On Sean's advice I used the furrr::future_map() function with future::plan("multisession") in Windows to get this to work.

The extra-mcmc runs now take only 5 minutes for 2,000 as opposed to 35 (with 7 cores).

I'll also parallelize the forecasting, catch-levels, and retrospective routines. I need to clean up the code before pushing but I wanted you all to know I think we can keep this stuff within the hake assessment project for now.

Following up on my earlier statements about fork(): This is true, you cannot spawn child processes in Windows. This method using the future package spawns new R sessions on the machine, which are not children of the original R session so it is allowed. It makes it more difficult to exchange data between them but it works well for something like this where you are running an executable which generates output files.

kellijohnson-NOAA commented 2 years ago

Update for 2022: a new version of Stock Synthesis will be available that stores all of the Report.sso files from MCMC sampling in a directory. This will be available for testing in December 2022.

kellijohnson-NOAA commented 2 years ago

Check with Rick to see if the derived quantity of q can be reported to Report.sso.

cgrandin commented 2 years ago

@kellijohnson-NOAA The ss_output.par files are not used, so no need for those. The CompReport.sso files contain Composition Database which we need for the mcmc -at-age tables. If that could appear in the Report file then we wouldn't need the CompReport files.

Catchability is coming from the report file (INDEX_2) and not from the par file.

kellijohnson-NOAA commented 2 years ago

The executable to test is located on the drive here. Please test as soon as possible so I can get feedback to Rick.

cgrandin commented 2 years ago

I have compiled and run this SS at this commit: https://github.com/nmfs-stock-synthesis/stock-synthesis/commit/c050381406d6d588d1cb5fa4ae09bc9d37cabe79 and all the posteriors.sso, derived_posteriors.sso, etc. are now populated and all "extra-mcmc" files are there.

The posteriors.sso, derived_posteriors.sso, etc. files are still in the sso folder though when they should be outside it in its parent directory for reproducibility. I have no doubt that will break a lot of people's code.

I still have to build the doc with the new output before saying it's fine but this is a step forward.

kellijohnson-NOAA commented 2 years ago

@cgrandin you can also opt to put all of the Report_ files in the main folder by not creating a sso folder. The thought of the sso folder is that people can delete every output file with one easy delete command. So, I think it might be easier to expect all output files to be in the original folder for us rather than creating sso?

cgrandin commented 2 years ago

OK, that's fine.. I was just thinking anyone who uses SS (if this is merged into main) may not expect that to happen and break their reading-in code. We can totally work with it though for hake. I did make a comment on the SS GitHub issue about it. Hope that was ok to do.

aaronmberger-nwfsc commented 2 years ago

The MCMC results from last years assessment model that used SS version (#V3.30.16.03;_2020_10_19) compared to the same model using SS version (#V3.30.xx.yy;_safe;_compile_date:_Dec 23 2021) produce very similar (near exact) estimates and derived quantities.

The only large difference was in the total likelihood. The contribution from Age_comps is considerably different (see inserted snippet), where the latest SS version (not released yet) was considerably greater (right side of insert). Is this related to a recent change in the Dirichlet multinomial likelihood?

kellijohnson-NOAA commented 2 years ago

working good enough for 2022, potentially refine in 2023 with smaller .rds files and better file management via a reduced report.sso file using options in the starter.ss file.

pacific-hake / hake-assessment

extra-mcmc -- move some burden to SS3 #656