transbioZI / RMTL

Regularized Multi-task Learning in R
https://CRAN.R-project.org/package=RMTL
19 stars 12 forks source link

bugs in parallel #1

Closed armgong closed 5 years ago

armgong commented 5 years ago

thank you for your works, but maybe there have bug in parallel functions,: 1 need add stopImplicitCluster() to stop cluster when compute finished, otherwise r background process will not close. 2 some times parallel cvMTL and single thread cvMTL got different result Lam1.min

set.seed(202000)
library(RMTL)
data <- Create_simulated_data(t=1,p=50,n=200000,
                              Regularization="Lasso",
                              type="Classification")

cvfit<-cvMTL(data$X, data$Y, type="Classification", 
                         Regularization="Lasso",nfolds = 10)

cvfit1<-cvMTL(data$X, data$Y, type="Classification", 
                         Regularization="Lasso",parallel = T,
                         ncores = 2,nfolds = 10)
cvfit$Lam1.min==cvfit1$Lam1.min

#run it 
> set.seed(202000)
> library(RMTL)
> data <- Create_simulated_data(t=1,p=50,n=200000,
+                               Regularization="Lasso",
+                               type="Classification")
> 
> cvfit<-cvMTL(data$X, data$Y, type="Classification", 
+                          Regularization="Lasso",nfolds = 10)
> 
> cvfit1<-cvMTL(data$X, data$Y, type="Classification", 
+                          Regularization="Lasso",parallel = T,
+                          ncores = 2,nfolds = 10)
> cvfit$Lam1.min==cvfit1$Lam1.min
[1] FALSE
> cvfit$Lam1.min
[1] 1e-04
> cvfit1$Lam1.min
[1] 0.001
> cvfit
$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04

$Lam1.min
[1] 1e-04

$Lam2
[1] 0

$cvm
[1] 0.499590 0.499590 0.221030 0.041510 0.035485 0.035435

attr(,"class")
[1] "cvMTL"
> cvfit1
$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04

$Lam1.min
[1] 0.001

$Lam2
[1] 0

$cvm
[1] 0.500890 0.500890 0.220945 0.041620 0.035445 0.035445

attr(,"class")
[1] "cvMTL"
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 
[2] LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMTL_0.9

loaded via a namespace (and not attached):
[1] compiler_3.5.3    parallel_3.5.3    tools_3.5.3       codetools_0.2-16 
[5] doParallel_1.0.14 iterators_1.0.10  foreach_1.4.4    
> 
hank9cao commented 5 years ago

Hi, 1 need add stopImplicitCluster() to stop cluster when compute finished, otherwise r background process will not close. Reply: thanks, I will add it later.

2 some times parallel cvMTL and single thread cvMTL got different result Lam1.min Reply: actually the difference you observed is not linked to the parallel computation, but due to the sampling variability of cross-validation. In cross-validation, subjects are randomly grouped thus might lead to a slightly different results. But the results should be similar.

Check the cvfit$cvm, which contained the averaged cv prediction error for each candidate Lam1. cvfit$cvm[c(5,6)] => 0.035485 0.035435 cvfit1$cvm[c(5,6)] => 0.035445 0.035445 (<= same error) See the the prediction error of 0.001 and 1e-4 as demonstrated above. They have quite similar leave-fold-out prediction performance. Therefore, due to the sampling variability of cross-validation, either parameter could be possibly selected.

In your specific case, cvfit1 selected 0.001 because the prediction error of lam1=0.001 is as same as lam1=1e-4, thus the algorithm tend to selected a more sparse solution.

Regards, Hank

armgong commented 5 years ago

Reply: actually the difference you observed is not linked to the parallel computation, but due to the sampling variability of cross-validation. In cross-validation, subjects are randomly grouped thus might lead to a slightly different results. But the results should be similar.

still a bit little confuse, I already use set.seed(20200) to avoid different sampling, i also read your code about cross validation , I think if we set.seed , then the cvpartition should be same:

getCVPartition <- function(Y, cv_fold, stratify){
task_num = length(Y);

randIdx <- lapply(Y, function(x) sample(1:length(x),
           length(x), replace = FALSE))        
cvPar = {};
for (cv_idx in 1: cv_fold){
    # buid cross validation data splittings for each task.
    cvTrain = {};
    cvTest = {};

    #stratified cross validation
    for (t in 1: task_num){
        task_sample_size <- length(Y[[t]]);

        if (stratify){
            ct <- which(Y[[t]][randIdx[[t]]]==-1);
            cs <- which(Y[[t]][randIdx[[t]]]==1);
            ct_idx <- seq(cv_idx, length(ct), cv_fold);
            cs_idx <- seq(cv_idx, length(cs), cv_fold);
            te_idx <- c(ct[ct_idx], cs[cs_idx]);
            tr_idx <- seq(1,task_sample_size)[
                !is.element(1:task_sample_size, te_idx)];

        } else {
            te_idx <- seq(cv_idx, task_sample_size, by=cv_fold)
            tr_idx <- seq(1,task_sample_size)[
                !is.element(1:task_sample_size, te_idx)];
        }

        cvTrain[[t]] = randIdx[[t]][tr_idx]
        cvTest[[t]] = randIdx[[t]][te_idx]
   }

    cvPar[[cv_idx]]=list(cvTrain, cvTest);
}
return(cvPar)
}
hank9cao commented 5 years ago

Hi, Can you test this to see if you still get different result please?

set.seed(202000)
cvfit<-cvMTL(data$X, data$Y, type="Classification", Regularization="Lasso",nfolds = 10)
set.seed(202000)
cvfit1 <- cvMTL(data$X, data$Y, type="Classification", Regularization="Lasso",parallel = T,ncores = 2,nfolds = 10)

Regards, Hank

armgong commented 5 years ago

yes set.seed twice got same result ,but it is strange , why need set.seed twice ?


$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04

$Lam1.min
[1] 0.001

$Lam2
[1] 0

$cvm
[1] 0.502010 0.502010 0.221005 0.041595 0.035435 0.035450

attr(,"class")
[1] "cvMTL"
> cvfit1
$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04

$Lam1.min
[1] 0.001

$Lam2
[1] 0

$cvm
[1] 0.502010 0.502010 0.221005 0.041595 0.035435 0.035450

attr(,"class")
[1] "cvMTL"
~~~R
hank9cao commented 5 years ago

Because set.seed() only controls your next step. Check this:

set.seed(202000)
sample(1:10)
sample(1:10)

and this:

set.seed(202000)
sample(1:10)
set.seed(202000)
sample(1:10)