mervesa / HiCDCPlus

HiCDCPlus
15 stars 2 forks source link

Setting seed does not make results deterministic #28

Open sandyfloren opened 1 month ago

sandyfloren commented 1 month ago

In the Quickstart, the following lines of code imply that setting the seed will control subsampling.

#expand features for modeling
gi_list<-expand_1D_features(gi_list)
#run HiC-DC+ on 2 cores
set.seed(1010) #HiC-DC downsamples rows for modeling
gi_list<-HiCDCPlus_parallel(gi_list,ncore=2)
head(gi_list)

After writing the gi_list to a file using gi_list_write(), I get different results when re-running the code with the exact same seed.

Most of the interactions in the output files have identical values in the "counts" column, but all the statistics are different (pvalue, qvalue, mu, stdev). This has the unfortunate consequence that when I threshold by a q-value of 0.1, I get different numbers of "significant" interactions called from one run to the next.

AndrewC160 commented 1 month ago

Hello Sandy,

I have not run into this problem personally because I haven't been able to use the HiCDCPlus_parallel function, but I've run into similar issues using seeds with the parallel package before:

For each instance started by parallel, seeds must be set separately. Maybe you could try running a modified version of HiCDCPlus_parallel() like the one I included below (added code is bracketed by ###s). If Im not mistaken, this version will call set.seed() for each instance and use the seed_input value. I cannot test it, however, because last time HiCDCPlus_parallel() crashed the core I ran it on :-)

If this doesn't work, I imagine you could just run your code with the regular HiCDCPlus() function and it should use the seed you set in your main environment.

HiCDCPlus_parallel <- function(gi_list, covariates = NULL, chrs = NULL, distance_type = "spline", model_distribution = "nb", binned = TRUE, df = 6, Dmin = 0, Dmax = 2e+06, ssize = 0.01, splineknotting = "uniform", ncore=NULL,seed_input=1010) {
    options(scipen = 9999, digits = 4)
    gi_list_validate(gi_list)
    if (is.null(chrs)) chrs <- names(gi_list)
    # check if D and counts exist on each chromosome
    if (!(all(vapply(gi_list, function(x) sum(colnames(S4Vectors::mcols(x)) %in% c("counts", "D")) == 2, TRUE)))) {
        stop("Some gi_list elements do not contain pairwise distances D
    and counts, the minimum set of features needed for HiC-DC+ modeling.")
    }
    if (!model_distribution %in% c("nb", "nb_hurdle", "nb_vardisp")) {
        stop("Allowable options for model_distribution are 'nb','nb_hurdle', and
    nb_vardisp'.")
    }
    if (is.null(ncore)) ncore<-parallel::detectCores()-1
    cl <- parallel::makeCluster(ncore)
    parallel::clusterEvalQ(cl,library("dplyr"))
    parallel::clusterEvalQ(cl,library("rlang"))
    ###Maybe this will work?
    parallel::clusterEvalQ(cl,set.seed(seed_input))
    ###    
    gi_list[chrs] <- parallel::parLapply(cl, gi_list[chrs], HiCDCPlus_chr,covariates=covariates,
                         distance_type=distance_type,
                         model_distribution=model_distribution,
                         binned=binned,
                         df=df,
                         Dmin=Dmin,
                         Dmax=Dmax,
                         ssize=ssize,
                         splineknotting = splineknotting,
                         model_filepath = NULL)
    parallel::stopCluster(cl)
    return(gi_list)
}