tuanle618 / AEDA

AEDA - Automated Data Exploratory Analysis in R
GNU General Public License v3.0
11 stars 3 forks source link

INFO: Weird random.seed for MDS and PCA Reports #54

Open tuanle618 opened 6 years ago

tuanle618 commented 6 years ago

When trying to run fastReport I noticed, that the ID for PCA and MDS Reports are the same. I believe that those functions (for pca prcomp() and for mds cmdscale() but also isoMDS and maybe the other methods in makeMDSTask() might as well set the seed after execution to the same seed as the pca. Because when calling makeReport(pca.result) and makeReport(mds.result) both report.ids are the same. I investigated this further and found out that when applying another report between mds and pca, like numsum for example and then after pca another report like catsum, the id for numsum and catsum are the same. This I believe confirms my believe, that somehow after the makeMDS and makePCA which are right before the makeReport step set the seed to the same number.

Reproducible error:

#start with clean R-session CTRL+SHIFT+F10

devtools::load_all()

set.seed(1)

my.mds.task = makeMDSTask(id = "swiss", data = swiss)
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeReport(mds.analysis)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeReport(cluster.analysis)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makeReport(pca.result)

#compare IDs
cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3"

#remove workspace
rm(list = ls())
#start with new session, CTRL+SHIFT+F10

devtools::load_all()

set.seed(1)

my.mds.task = makeMDSTask(id = "swiss", data = swiss)
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeMDSAnalysisReport(mds.analysis)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makePCAReport(pca.result)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeClusterAnalysisReport(cluster.analysis)

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3"

cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

###try even more reports:
rm(list=ls())

#clean r session

devtools::load_all()

#try different seed
set.seed(10)

#for MDS try even another method
my.mds.task = makeMDSTask(id = "swiss", data = swiss, method = "isoMDS")
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeReport(mds.analysis)

num.sum.task = makeNumSumTask("iris.test", iris, target = "Species")
num.sum = makeNumSum(num.sum.task)
num.sum.report = makeReport(num.sum)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makeReport(pca.result)

cat.sum.task = makeCatSumTask("iris.test", iris, target = "Species")
cat.sum = makeCatSum(cat.sum.task)
cat.sum.report = makeReport(cat.sum)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeReport(cluster.analysis)

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

num.sum.report$report.id
#[1] "mcu73QN9ORHKrj73"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3" ---> SAME

cat.sum.report$report.id
#[1] "mcu73QN9ORHKrj73" ---> now catsum has the same report ID like num sum, which right after mds #was called

cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

As of now I set the seed to 89 in makeReport.PCAObj and makePCAReport to manually set another seed and fix the issue.

MiGraber commented 6 years ago

I dont think the problem is which seed set but that a seed is set at all. This makes the everything deterministic. I will check if adding set.seed(Sys.time()) before the random id generation will solve this.

MiGraber commented 6 years ago

Why is in makePCAReport a seed?

tuanle618 commented 6 years ago

@MiGraber just in case if the user calls this function instead the S3-Method. If you dont set a seed it still happens. You can try it

MiGraber commented 6 years ago

Ok I think i got it. It seems to be ggscatter function. I removed all set seeds in our function. -> As you said the issue is still there Then I tracked the .Random.seed variable and looked when it changed. -> I dont know why but ggscatter seems to set a seed. Adding set.seed(Sys.time()) after ggscatter fixes the bug

MiGraber commented 6 years ago

And refering to the seed in makePCAReport. Why is there a seed in the S3-Method.

here an example of ggscatter:

library(ggpubr)
# Load data
data("mtcars")
df <- mtcars
df$cyl <- as.factor(df$cyl)
head(df[, c("wt", "mpg", "cyl")], 3)

for (i in 1:2) {
  mds.plot = ggscatter(df, x = "wt", y = "mpg",
    label = rownames(df),
    size = 1,
    repel = FALSE) + theme_classic(base_size = 10) + ggtitle("title")
  print(runif(1))
}

returns

[1] 0.2875775
[1] 0.2875775
tuanle618 commented 6 years ago

Thank you @MiGraber . I will push a small change in a test to pass travis, could you then please add the set.seed(Sys.time()) in the needed functions in order to fix the bug? Please remove the set.seed(89) in the 2 makePCAReport functions in tle_vignette branch.

I added in both functions a seed just in case the user either calls makeReport(pca.result) or makePCAReport(pca.result)

MiGraber commented 6 years ago

There seem to be some more functions which do this. For example fviz_nbclust. I think will take another apraoch to fix this.

tuanle618 commented 6 years ago

Alright, great. If you need support let me know. Besides that there are additionally plot function which might cause a setting seed:

1) getMDSAnalysis calls ggscatter as well: https://github.com/ptl93/AEDA/blob/3f1685af61dcc620d98933672d2f4c91f9ef7bde/R/getMDSAnalysis.R#L55

2) getClusterAnalysis with all its methods for plotting mostly fviz_nbclust [this is for the analysis to get the optimal no. of cluster], fviz_cluster [for plotting the cluster result] fviz_dist [for plotting the distance in case of hierarchical clustering] fviz_dend [in case of plotting the result of hierarchical clustering] fviz_silhouette [plotting silhouette plot for analysis] .... there might be more, just scam the code and everything with fviz_ might be problematic?

https://github.com/ptl93/AEDA/blob/tle_vignette/R/getClusterAnalysis.R

3) makePCA calls fviz_eigen() , autoplot() and fviz_pca_ind() https://github.com/ptl93/AEDA/blob/3f1685af61dcc620d98933672d2f4c91f9ef7bde/R/makePCA.R#L40-L49

You might just set a random seed depending of Sys.time right before the corresponding function 3 get-functions end.

MiGraber commented 6 years ago

After I found out that more functions set.seeds I concentrated on reportID(). But the solution is not very nice. If you want to test it... I dont get this bug anymore.