mayer79 / outForest

Outlier detection based on random forest models
https://mayer79.github.io/outForest/
GNU General Public License v2.0
13 stars 2 forks source link

Use of random seeds for reproducibility #3

Closed fanavarro closed 2 years ago

fanavarro commented 3 years ago

Hi, first of all, thanks for developing this package, it is a very interesting tool for outlier detecting.

I am writing because I am experiencing some problems regarding the reproducibility of my results. I am not sure if I am using the package correctly or not.

Concretely, I'm using the outForest function as follows:

outliers = outForest(x, replace = "NA", seed = 12345)

Where x is my dataframe with my individuals and their variables, I replace the outliers by "NA", and I set the integer 12345 as a random seed for reproducibility.

Then, as I'm using the "seed" parameter, I expect to obtain the same results every time I call that function with that parameters. I am using R studio and, when I click on "source" in order to execute the whole R script, I obtain the same results between different executions. Nonetheless, if I execute the R script line by line, the results differ from what I obtained by clicking the "source" button.

So, I am not very sure if I have to do something else in addition to setting the "seed" parameter when I use the outForest function. Would it be necessary to use set.seed(12345)? Or could this be a bug?

Thanks beforehand.

mayer79 commented 3 years ago

Thanks for digging into outForest. If you set seed, the results should be identical.

I tested with the following example:

library(outForest)

out = outForest(iris, replace = "NA", seed = 12345, verbose = 0)
print(outliers(out))

I keep getting the same output, no matter whether clicking on "source" or simply running the code over and over again.

Maybe something happens before you call the function? If you post an example code where you see the problem, I can test again.

fanavarro commented 3 years ago

Hi @mayer79, thanks for your quick answer! I've been testing my script and I'm getting the same results for the last executions. Thus, I guess something changed in my data and the seed parameter is working as expected. My apologies for opening this issue, and thanks again for this library.

Kind regards, Francisco Abad.

mayer79 commented 3 years ago

This is very comforting to hear :-).

fanavarro commented 2 years ago

Hi again... I am reopening this issue because I've experienced it again... I've a complex R script but, in summary, I preprocess the data by doing the following:

  1. Remove incomplete rows. If a row contain NA for any of its columns, I remove the row.
  2. Remove near zero variance columns. This is detected through the nearZeroVar function from caret package, which is deterministic.
  3. Min-max data nomalization.
  4. Outliers removal by using outForest. The concrete line is outliers = outForest(x, replace = "NA", seed = 12345), where x is the dataframe that is being processed. I also export a CSV file with the outliers detected with write.csv(outliers(outliers), file = outliersFile, sep = ",").
  5. Finally, I remove again those variables with a near zero variance.

So, I am using outForest in the step 4. I though that the problem was due to a change in my data, so I included a CSV export of the data on every step described before so that we can check the input and the outputs of everything. I was obtaining the same results but then, they changed again. Fortunately I could check and compare the input and the output of the line outliers = outForest(x, replace = "NA", seed = 12345) for the different executions.

I compared the files from both executions by using Beyond Compare software, which is able to compare files at binary level. The following picture shows this comparison, where data0.csv is the original data, and datan.csv is the output of the step n described above. So the input for the outForest (data3.csv) is the same for both executions, but the output (data4.csv) is different.

imagen

I am attaching a zip file with the data3.csv (input of outForest), the data4.csv (output of outForest), and the outliers.csv (outliers detected) for both executions. In theory the seed and the input are the same for both executions.

I think this is a very strange and weird issue. My script is long and I usually test some parts separately in an interactive session of R studio. I observed that the results changed after I clean the variables from the R studio, and when I modified the R script in order to include new things that do not impact on the preprocessing described before; as you can see, the input and the seed are the same for both executions. Moreover, both executions were done by clicking on "source" in R studio. Maybe some issue related with R studio, or some state variable regarding random numbers... I have no idea. Please contact me if you need further information, and thanks again.

Francisco Abad.

outForestExecutions.zip

mayer79 commented 2 years ago

Hello @fanavarro : Maybe your preprocessing depends on a random seed? You can test by putting on top of your script a set.seed(33420) and see if this solves the problem.

fanavarro commented 2 years ago

Hi @mayer79, the outForest is the unique procedure that uses random number in my preprocessing step. I'm not aware of how R deals with random seeds. Should I expect different results for the line outliers = outForest(x, replace = "NA", seed = 12345) If I previously set different seeds with set.seed()? In the following code, is it possible that outliers1 are different to outliers2?

set.seed(11111)
outliers1 = outForest(x, replace = "NA", seed = 12345)

set.seed(22222)
outliers2 = outForest(x, replace = "NA", seed = 12345)

I though that set.seed did not impact on the results if I was using the 'seed' parameter in the outForest function.

mayer79 commented 2 years ago

The output of outForest depends on the input x. So if the input changes, also the output changes, no matter the seed. If you can prove that the immediate input to outForest is the same and the immediate output differs despite a seed, then this could point out a problem in outForest.

fanavarro commented 2 years ago

Thanks for your quick answer, @mayer79. Sorry for the misunderstanding in my previous comment; in the code I included I meant independent executions of outForest with the same input 'x' and the same seed as parameters, but including the set.seed prior to the execution setting different seeds, but I did not explain it very well.

Coming back to my issue, as I commented before, I included some code in my script in order to save csv files of my data at each step in the preprocessing procedure. This is detailed in one of my previous comments but, in summary, I have 2 executions of outForest with the same input and the same seed (by using the 'seed' parameter of the outForest function) that generate different results. In my previous comment I attached the information I have for both executions: the input (which is the same for both), and the output (which is different), as well as the outliers detected (which are also different in both executions).

Normally I can execute my script many times obtaining the same results but, at some point, the results change (sometimes after making minor changes in the script that do not affect the data nor the preprocessing, just things like formatting a plot).

I am not aware about how outForest uses the 'seed' parameter or if the function set.seed has an impact on the outForest execution in despite of using the 'seed' parameter.

I've a colleague, @neobernad, that developed an R library that also uses random seeds. I know that he also experienced this kind of issues when he tried to publish the library in bioconductor, something like the unitary test failed at the bioconductor server because the library was obtaining different results from a bootstrapping process. I am mentioning him because maybe he can help us.

Thank you again, Francisco Abad.

neobernad commented 2 years ago

Hello everyone,

I have experienced something similar when developing our package: for the same input, the output was different depending on the execution. I had to granurally check the packages that I was importing to see how they were using the seeds, and also I had to put the following code after the definition of the function that provided the core functionality:

old.seed <- .Random.seed # Back up the current seed in the system
on.exit( { .Random.seed <<- old.seed } ) # Recover whatever seed was stored in the back up upon exit
if (!is.null(seed)) set.seed(seed) # Set user's seed

Best, José Antonio

mayer79 commented 2 years ago

@neobernad Thanks for your input. outForest() does not reset the seed after execution. This is not necessarily the recommended way, but an easy one.

Like this:

  if (!is.null(seed)) {
    set.seed(seed)
  }

If the same seed it passed to multiple calls of outForest() and the input itself is 100% identical, then I would expect the same output. Except maybe if an internal C seed within ranger() interferes?

@fanavarro: If you call a set.seed before the snipped above, this would not have an impact for ` outForest(), so I don't understand what is going on.

fanavarro commented 2 years ago

@mayer79 and @neobernad, thanks for your messages. I've check the code of outForest and I realized that the management of the seed parameter is very simple as it consists on doing set.seed(seed) if seed is specified, so I think the problem is outside the library.

I've been reading some documentation on this, and it seems that the random number generator changed from R 3.5 to R 3.6 [1, 2]. Moreover, there are some global settings that can modify the behavior of the random generator through RNGkind [3].

My thought here is that it is possible that some libraries are modifying this global settings for generating random numbers so that they obtain the same results in R3.5 and R3.6. I have to research more on this to identify the cause of my issue but, definitely, there is nothing to do with outForest, so I am closing this issue again.

Thanks again, @mayer79 and @neobernad for your time and your responses.

[1] https://community.rstudio.com/t/getting-different-results-with-set-seed/31624 [2] https://stackoverflow.com/questions/53755955/different-results-even-with-set-seed-in-r [3] https://stackoverflow.com/questions/47199415/is-set-seed-consistent-over-different-versions-of-r-and-ubuntu

mayer79 commented 2 years ago

@fanavarro : thanks for digging into this. I will still keep my eyes open for possible explanations and check ranger()