marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

subsample outside of loop #15

Open boennecd opened 6 years ago

boennecd commented 6 years ago

This https://github.com/marjoleinF/pre/blob/e156fbe9969df778781081fc69c4da23bf368dab/R/pre.R#L949-L963

is bad idea as it scales very poorly. E.g., suppose there are 200000 rows, we want to sub-sample to half the size and we want 1000 trees. Then it requires 10^5 10^3 4 = 400 mega bytes of ram. I do see that the reason is the foreach call later https://github.com/marjoleinF/pre/blob/e156fbe9969df778781081fc69c4da23bf368dab/R/pre.R#L998

However, one may still be able to get reproducible results if clusterSetRNGStream is used with foreach. Though, I have not used the foreach package much and it requires that the loop iterator is split equally to each thread. An alternative is to replace the foreach with parSapply which I know will work.