Open hideaki opened 4 years ago
That's a side effect of having a pure C++ and R version. We don't use the R RNG but seed the mt19937_64
with a random number generated in R. The number in R is the same on Mac and Linux/Windows but the mt19937_64
doesn't behave the same.
A possible solution is to encapsulate the random number generator and use the R RNG via Rcpp in the R version. Other solutions are very welcome!
Thank you for the quick answer! It would be great if R version of ranger called R RNG via Rcpp, making the result reproducible across platforms!
A different RNG did not solve the issue (see #688). From what I've read, the problem might be with std::shuffle
and not the RNG. What I think we could do:
boost::random
(another dependency but probably solves the issue and seems not to hard to do)std::sample
(not sure whether that's enough) Thank you for providing this package!
To piggyback off this issue, we believe we are running into a similar issue in our project. In our case, we are tuning the complexity of the random forest by choosing max.depth
and mtry
hyperparameters, via 10-fold cross validation minimizing MSE loss. Here is a reproducible example:
Below are some plots of MSE vs. max.depth
for various values of mtry
. As noted in the previous comments, using different operating system/seed combinations seems to give different results:
We are curious in particular why it would be that (i) the seed/system affects the shape of the plot in a way that seems very fundamental (not just jittering it) and (ii) in some cases, the results do not quite conform to the usual tendency for fit to initially improve with complexity and then eventually worsen (e.g. orange series for bagsize=3
).
Any additional insight would be appreciated!
Not sure these are systematic differences. You are setting too many seeds and that, I think, leads to these smooth-looking line plots. If I run your code without any seeds, none of the patterns in the plot above is visible.
In summary, I think those are simple random differences that look systematic because you set seeds in a loop (rarely a good idea).
@mnwright thanks for your reply!
If I'm parsing the example code correctly, it's using a single set of folds to calculate MSE with various parameters (b,d).
For each parameter, it resets the seed so that only the parameters, not the seed, is varying across elements of the loop.
Can you say a little more about the sense in which this is "setting too many seeds"? Thanks!
What I mean is: If you run the same simulation with different seeds on the same platform, you'll get a similar picture as that above:
Running with the same seed on Mac and Windows is the same as running with two different seeds on the same platform. And that is (unfortunately) expected behavior.
@mnwright understood, thanks. We wanted to confirm that there is no platform difference other than seed behavior. Sounds like that is your understanding. We appreciate the clarification.
Thank you for creating this wonderful package. I tried the following example on Mac and Windows, and it seems that the results are different. I was expecting the same results since the same seed was set, but is this difference expected?
Script to reproduce:
Mac result:
Windows result: