Results are different between on Mac and on Windows

hideaki commented 4 years ago

Thank you for creating this wonderful package. I tried the following example on Mac and Windows, and it seems that the results are different. I was expecting the same results since the same seed was set, but is this difference expected?

Script to reproduce:

library(ranger)
set.seed(1)
c <- ranger(Species ~ ., data = iris, seed=NULL, num.trees = 1, importance = "impurity", num.threads = 1)
print(c)
c$variable.importance

Mac result:

> print(c)
Ranger result

Call:
 ranger(Species ~ ., data = iris, seed = NULL, num.trees = 1,      importance = "impurity", num.threads = 1) 

Type:                             Classification 
Number of trees:                  1 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 1 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             10.77 % 
> c$variable.importance
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   20.405419     0.000000    75.928635     1.945946 
>

Windows result:

> print(c)
Ranger result

Call:
 ranger(Species ~ ., data = iris, seed = NULL, num.trees = 1,      importance = "impurity", num.threads = 1) 

Type:                             Classification 
Number of trees:                  1 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 1 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             7.69 % 
> c$variable.importance
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    2.641367     0.000000     5.873265    91.365368

mnwright commented 4 years ago

That's a side effect of having a pure C++ and R version. We don't use the R RNG but seed the mt19937_64 with a random number generated in R. The number in R is the same on Mac and Linux/Windows but the mt19937_64 doesn't behave the same.

A possible solution is to encapsulate the random number generator and use the R RNG via Rcpp in the R version. Other solutions are very welcome!

hideaki commented 4 years ago

Thank you for the quick answer! It would be great if R version of ranger called R RNG via Rcpp, making the result reproducible across platforms!

mnwright commented 1 year ago

A different RNG did not solve the issue (see #688). From what I've read, the problem might be with std::shuffle and not the RNG. What I think we could do:

Encapsulate each random draw to use Rcpp in the R version (not so easy and complicates the code)
Use boost::random (another dependency but probably solves the issue and seems not to hard to do)
Implement our own std::sample (not sure whether that's enough)

ew487 commented 8 months ago

Thank you for providing this package!

To piggyback off this issue, we believe we are running into a similar issue in our project. In our case, we are tuning the complexity of the random forest by choosing max.depth and mtry hyperparameters, via 10-fold cross validation minimizing MSE loss. Here is a reproducible example:

Code for reproducible example

```R library(ranger) library(tidyverse) main <- function() { max_depth <- 40 num_folds <- 10 df <- generate_data(num_folds) yvar <- 'y'; xvars <- colnames(df)[grepl("x", colnames(df))] set.seed(2) seed <- 66 cv_df <- mse_cv_rf(df, xvars, yvar, max_depth, num_folds, seed) cv_df <- cv_df[order(cv_df$loss),] %>% as.data.frame() opt_rfparams <- unlist(cv_df[1,]) print(opt_rfparams) system <- get_os() write.csv(cv_df, file = sprintf("cv_df_%s_seed%s.csv", system, seed), row.names = FALSE) if (system == "windows") { seed <- 67 cv_df <- mse_cv_rf(df, xvars, yvar, max_depth, num_folds, seed) cv_df <- cv_df[order(cv_df$loss),] %>% as.data.frame() opt_rfparams <- unlist(cv_df[1,]) print(opt_rfparams) system <- get_os() write.csv(cv_df, file = sprintf("cv_df_%s_seed%s.csv", system, seed), row.names = FALSE) } } generate_data <- function(num_folds) { n <- 100 set.seed(1) df <- data.frame(rownum = seq(1, length.out = n)) %>% mutate(x = diffinv(rnorm(n - 1)), y = ceiling(runif(n, 0, 3))/3, wt = 1, d_x = x - dplyr::lag(x, n = 1)) xvars <- c('x', 'd_x') for(l in seq(1, 5)){ lagvar <- 'd_x' new_var <- paste(lagvar, '_lag', l, sep = '') df[[new_var]] <- dplyr::lag(df[[lagvar]], n = l) xvars <- c(xvars, new_var) new_var <- paste(lagvar, '_slag', l, sep = '') df[[new_var]] <- dplyr::lag(df[[lagvar]], n = l * 12) xvars <- c(xvars, new_var) } df <- df %>% filter(complete.cases(.)) %>% mutate(fold = floor(seq(1, n()) / n() * num_folds) + 1) return(df) } mse_cv_rf <- function(df, xvars, yvar, max_depth, num_folds, seed) { bag_grid <- 1:length(xvars) depth_grid <- 1:max_depth cv_df <- merge(data.frame(depth = depth_grid), data.frame(bagsize = bag_grid), on = NULL) rfform <- sprintf('%s ~ %s', yvar, paste(xvars, collapse = '+')) for(idx in 1:nrow(cv_df)){ b <- cv_df[idx, 'bagsize'] d <- cv_df[idx, 'depth'] mse_list = c() for (k in seq(1, num_folds)){ traindf <- df[df$fold != k,] valdf <- df[df$fold == k,] W_train <- traindf[['wt']] W_test <- valdf[['wt']] rf_obj <- ranger(rfform, data = traindf, case.weights = W_train, max.depth = d, mtry = b, min.node.size = 1, num.trees = 100, seed = seed) Yhat<- predict(rf_obj, data = valdf)$predictions mse_k <- mean(W_test * (valdf[[yvar]] - Yhat)^2) mse_list <- c(mse_list, mse_k) } mse_rf <- mean(mse_list) cv_df[idx, 'loss'] <- mse_rf } return(cv_df) } get_os <- function() { if (.Platform$OS.type == "windows") { system <- "windows" } else if (Sys.info()["sysname"] == "Darwin") { system <- "mac" } else if (.Platform$OS.type == "unix") { system <- "unix" } else { system <- "unknown" } return(system) } main() ```

Below are some plots of MSE vs. max.depth for various values of mtry. As noted in the previous comments, using different operating system/seed combinations seems to give different results:

Code for plotting (after running above example code on both Windows and Mac)

```R library(ggplot2) library(tidyverse) df <- rbind(read.csv("cv_df_windows_seed66.csv") %>% mutate(label = "WindowsSeed66"), read.csv("cv_df_mac_seed66.csv") %>% mutate(label = "MacSeed66"), read.csv("cv_df_windows_seed67.csv") %>% mutate(label = "WindowsSeed67")) plot <- ggplot(data = df, mapping = aes(x = depth, y = loss, color = label)) + geom_point() + scale_color_manual(values = c(WindowsSeed66 = "orange", WindowsSeed67 = "grey", MacSeed66 = "purple")) + facet_wrap(~ bagsize, ncol = 4, labeller = label_both) plot ```

We are curious in particular why it would be that (i) the seed/system affects the shape of the plot in a way that seems very fundamental (not just jittering it) and (ii) in some cases, the results do not quite conform to the usual tendency for fit to initially improve with complexity and then eventually worsen (e.g. orange series for bagsize=3).

Any additional insight would be appreciated!

mnwright commented 8 months ago

Not sure these are systematic differences. You are setting too many seeds and that, I think, leads to these smooth-looking line plots. If I run your code without any seeds, none of the patterns in the plot above is visible.

In summary, I think those are simple random differences that look systematic because you set seeds in a loop (rarely a good idea).

jmshapir commented 8 months ago

@mnwright thanks for your reply!

If I'm parsing the example code correctly, it's using a single set of folds to calculate MSE with various parameters (b,d).

For each parameter, it resets the seed so that only the parameters, not the seed, is varying across elements of the loop.

Can you say a little more about the sense in which this is "setting too many seeds"? Thanks!

mnwright commented 8 months ago

What I mean is: If you run the same simulation with different seeds on the same platform, you'll get a similar picture as that above:

Running with the same seed on Mac and Windows is the same as running with two different seeds on the same platform. And that is (unfortunately) expected behavior.

jmshapir commented 8 months ago

@mnwright understood, thanks. We wanted to confirm that there is no platform difference other than seed behavior. Sounds like that is your understanding. We appreciate the clarification.

imbs-hl / ranger

Results are different between on Mac and on Windows #533