Open sandyfloren opened 4 months ago
Hello Sandy,
I have not run into this problem personally because I haven't been able to use the HiCDCPlus_parallel function, but I've run into similar issues using seeds with the parallel package before:
For each instance started by parallel
, seeds must be set separately. Maybe you could try running a modified version of HiCDCPlus_parallel() like the one I included below (added code is bracketed by ###s). If Im not mistaken, this version will call set.seed() for each instance and use the seed_input
value. I cannot test it, however, because last time HiCDCPlus_parallel() crashed the core I ran it on :-)
If this doesn't work, I imagine you could just run your code with the regular HiCDCPlus() function and it should use the seed you set in your main environment.
HiCDCPlus_parallel <- function(gi_list, covariates = NULL, chrs = NULL, distance_type = "spline", model_distribution = "nb", binned = TRUE, df = 6, Dmin = 0, Dmax = 2e+06, ssize = 0.01, splineknotting = "uniform", ncore=NULL,seed_input=1010) {
options(scipen = 9999, digits = 4)
gi_list_validate(gi_list)
if (is.null(chrs)) chrs <- names(gi_list)
# check if D and counts exist on each chromosome
if (!(all(vapply(gi_list, function(x) sum(colnames(S4Vectors::mcols(x)) %in% c("counts", "D")) == 2, TRUE)))) {
stop("Some gi_list elements do not contain pairwise distances D
and counts, the minimum set of features needed for HiC-DC+ modeling.")
}
if (!model_distribution %in% c("nb", "nb_hurdle", "nb_vardisp")) {
stop("Allowable options for model_distribution are 'nb','nb_hurdle', and
nb_vardisp'.")
}
if (is.null(ncore)) ncore<-parallel::detectCores()-1
cl <- parallel::makeCluster(ncore)
parallel::clusterEvalQ(cl,library("dplyr"))
parallel::clusterEvalQ(cl,library("rlang"))
###Maybe this will work?
parallel::clusterEvalQ(cl,set.seed(seed_input))
###
gi_list[chrs] <- parallel::parLapply(cl, gi_list[chrs], HiCDCPlus_chr,covariates=covariates,
distance_type=distance_type,
model_distribution=model_distribution,
binned=binned,
df=df,
Dmin=Dmin,
Dmax=Dmax,
ssize=ssize,
splineknotting = splineknotting,
model_filepath = NULL)
parallel::stopCluster(cl)
return(gi_list)
}
In the Quickstart, the following lines of code imply that setting the seed will control subsampling.
After writing the gi_list to a file using gi_list_write(), I get different results when re-running the code with the exact same seed.
Most of the interactions in the output files have identical values in the "counts" column, but all the statistics are different (pvalue, qvalue, mu, stdev). This has the unfortunate consequence that when I threshold by a q-value of 0.1, I get different numbers of "significant" interactions called from one run to the next.