Closed armgong closed 5 years ago
Hi, 1 need add stopImplicitCluster() to stop cluster when compute finished, otherwise r background process will not close. Reply: thanks, I will add it later.
2 some times parallel cvMTL and single thread cvMTL got different result Lam1.min Reply: actually the difference you observed is not linked to the parallel computation, but due to the sampling variability of cross-validation. In cross-validation, subjects are randomly grouped thus might lead to a slightly different results. But the results should be similar.
Check the cvfit$cvm
, which contained the averaged cv prediction error for each candidate Lam1.
cvfit$cvm[c(5,6)]
=> 0.035485 0.035435
cvfit1$cvm[c(5,6)]
=> 0.035445 0.035445 (<= same error)
See the the prediction error of 0.001 and 1e-4 as demonstrated above. They have quite similar leave-fold-out prediction performance. Therefore, due to the sampling variability of cross-validation, either parameter could be possibly selected.
In your specific case, cvfit1 selected 0.001 because the prediction error of lam1=0.001 is as same as lam1=1e-4, thus the algorithm tend to selected a more sparse solution.
Regards, Hank
Reply: actually the difference you observed is not linked to the parallel computation, but due to the sampling variability of cross-validation. In cross-validation, subjects are randomly grouped thus might lead to a slightly different results. But the results should be similar.
still a bit little confuse, I already use set.seed(20200) to avoid different sampling, i also read your code about cross validation , I think if we set.seed , then the cvpartition should be same:
getCVPartition <- function(Y, cv_fold, stratify){
task_num = length(Y);
randIdx <- lapply(Y, function(x) sample(1:length(x),
length(x), replace = FALSE))
cvPar = {};
for (cv_idx in 1: cv_fold){
# buid cross validation data splittings for each task.
cvTrain = {};
cvTest = {};
#stratified cross validation
for (t in 1: task_num){
task_sample_size <- length(Y[[t]]);
if (stratify){
ct <- which(Y[[t]][randIdx[[t]]]==-1);
cs <- which(Y[[t]][randIdx[[t]]]==1);
ct_idx <- seq(cv_idx, length(ct), cv_fold);
cs_idx <- seq(cv_idx, length(cs), cv_fold);
te_idx <- c(ct[ct_idx], cs[cs_idx]);
tr_idx <- seq(1,task_sample_size)[
!is.element(1:task_sample_size, te_idx)];
} else {
te_idx <- seq(cv_idx, task_sample_size, by=cv_fold)
tr_idx <- seq(1,task_sample_size)[
!is.element(1:task_sample_size, te_idx)];
}
cvTrain[[t]] = randIdx[[t]][tr_idx]
cvTest[[t]] = randIdx[[t]][te_idx]
}
cvPar[[cv_idx]]=list(cvTrain, cvTest);
}
return(cvPar)
}
Hi, Can you test this to see if you still get different result please?
set.seed(202000)
cvfit<-cvMTL(data$X, data$Y, type="Classification", Regularization="Lasso",nfolds = 10)
set.seed(202000)
cvfit1 <- cvMTL(data$X, data$Y, type="Classification", Regularization="Lasso",parallel = T,ncores = 2,nfolds = 10)
Regards, Hank
yes set.seed twice got same result ,but it is strange , why need set.seed twice ?
$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04
$Lam1.min
[1] 0.001
$Lam2
[1] 0
$cvm
[1] 0.502010 0.502010 0.221005 0.041595 0.035435 0.035450
attr(,"class")
[1] "cvMTL"
> cvfit1
$Lam1_seq
[1] 1e+01 1e+00 1e-01 1e-02 1e-03 1e-04
$Lam1.min
[1] 0.001
$Lam2
[1] 0
$cvm
[1] 0.502010 0.502010 0.221005 0.041595 0.035435 0.035450
attr(,"class")
[1] "cvMTL"
~~~R
Because set.seed() only controls your next step. Check this:
set.seed(202000)
sample(1:10)
sample(1:10)
and this:
set.seed(202000)
sample(1:10)
set.seed(202000)
sample(1:10)
thank you for your works, but maybe there have bug in parallel functions,: 1 need add stopImplicitCluster() to stop cluster when compute finished, otherwise r background process will not close. 2 some times parallel cvMTL and single thread cvMTL got different result Lam1.min