hmsc-r / HMSC

GNU General Public License v3.0
102 stars 37 forks source link

speeding up cross-validation #132

Open stephanJG opened 2 years ago

stephanJG commented 2 years ago

Hi, I am wondering if there is an option to speed up the cross-validation. Although I have access to a HPC this still takes very long, which is (as I understand) due to the fact that each chain is bound to one core.

Is it possible to split the cross-validation? For a 4 fold cross-validation I have tried to replace fold 2, 3, and 4 within the createPartition object with NA and hoped that the computePredictedValues function would only estimate fold 1. But this was not accepted:

Error in matrix(NA, sum(train), hM$nr) : 
  invalid 'nrow' value (too large or NA)

If this would work I could to this for each fold separately (with separate jobs on the HPC) and combine the cross-validate measures of fit afterwards. I guess one thing that would work is to replace with another number, hence I have 2 uneven folds; would do that for each fold; use the 4 smaller folds to summarize to a 4 fold validation. Best Jörg

jarioksa commented 2 years ago

There is a work in progress with more aggressive parallelization: k chains and n folds can be run in k × n parallel processes (if your hardware allows). This is not yet implemented for species cross-validation which becomes slow with mcmcStep. It is an easy-ish task to extend this to species cross-validation, but needs some thinking to choose between two alternative ways of making this.

The experimental version is in separate branch parallel-CV-2 (number one never was public). This implements parallel processing in alternative function pcomputePredictedValues, and the old untouched version is still available and will not vanish if you try the new one (and also allows comparison of results). Install this with devtools::install_github("hmsc-r/HMSC", ref="parallel-CV-2").

There are caveats:

There are other tricks that we can try, but these need more experimentation. One problem is that we really do re-sample the original model in its full scale for each fold, and this takes about the same time as the original sampleMcmc – and for each fold. It can be that there are some shortcuts to make this quicker, but this is something we need to discuss among developers.