mervesa / HiCDCPlus

HiCDCPlus
15 stars 2 forks source link

Problem with gi_list_write #1

Closed csijcs closed 3 years ago

csijcs commented 3 years ago

Very excited to try your tool! I'm having a little trouble with the gi_list_write command. I get the following error:

Error in gi_list_write(gi_list, fname = output_path) : 
  R_Reprotect: only 3 protected items, can't reprotect index -2

I am able to write on the significant rows by calling: gi_list_write(gi_list,fname=output_path,rows="significant")

However, then when I try to run hicdcdiff on those files I get an error I suppose is related to the fact that there are only the significant regions and not all regions for background: Error in DESeq2::estimateDispersionsFit(dds, fitType = fitType) : all gene-wise dispersion estimates are within 2 orders of magnitude from the minimum value, and so the standard curve fitting techniques will not work. One can instead use the gene-wise estimates as final estimates: dds <- estimateDispersionsGeneEst(dds) dispersions(dds) <- mcols(dds)$dispGeneEst ...then continue with testing using nbinomWaldTest or nbinomLRT

I have tried with multiple different samples and get the same errors. Here is my sessionInfo: `sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] DESeq2_1.30.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
[4] MatrixGenerics_1.2.0 matrixStats_0.57.0 dplyr_1.0.2
[7] BSgenome.Hsapiens.UCSC.hg38_1.4.3 BSgenome_1.58.0 rtracklayer_1.50.0
[10] Biostrings_2.58.0 XVector_0.30.0 GenomicRanges_1.42.0
[13] GenomeInfoDb_1.26.2 IRanges_2.24.1 S4Vectors_0.28.1
[16] BiocGenerics_0.36.0 HiCDCPlus_0.99.5
`

jaavedm commented 3 years ago

I am also getting some memory related errors

Error in gi_list_write(gi_list, fname = output_path) :
  R_Reprotect: only -7 protected items, can't reprotect index -12

I tried increasing the size of R memory size

cat ~/.Renviron
R_MAX_VSIZE=760Gb

but this does not help.

My sessionInfo() is roughly the same as the original poster.

mervesa commented 3 years ago

Hi @csijcs @jaavedm , I would like to get reproducible data examples that generates the error, maybe some paths to input data files in case these are accessible, or an RDS save of one list item of the gi_list object, etc. that can process gi_list_write(gi_list,fname=output_path,rows="significant") but not gi_list_write(gi_list, fname = output_path). I don't think this is a memory error.

As for the hicdcdiff() error, the likely cause seems to be on the dispersion fit type. I should follow the DESeq2 error and go with gene wise estimates by raising a warning. I would still like to see the input data and the filter_file that generates this error though.

jaavedm commented 3 years ago

Hi @mervesa,

Unfortunately, the error does not happen with just one input file, but rather after processing 2-3 files and then moving onto the next file in the for loop. I've sent you a few files to your email address. Here is my repro example code. I didn't get to the point of calling hicdcdiff() yet.

library("HiCDCPlus")
library("DESeq2")

setwd("/storage/jaavedm/")

#generate features
outdir="/home/jaavedm/";
construct_features(output_path=paste0(outdir,"/hg38_100kb_GATC"),
                   gen="Hsapiens",gen_ver="hg38",
                   sig="GATC",bin_type="Bins-uniform",
                   binsize=100000,
                   chrs=c("chr1"))

#add .hic counts
hicproRoot <- "/storage/jaavedm/myfiles/"
file.bases <- c(
        "H9_H3K27ac_rep1",
        "H9_H3K27ac_rep2",
        "H9_H3K27ac_rep3",
        "H9_H3K27ac_rep4",
        "H9_H3K27ac_rep5"
)

hicfile_paths<-c();
for(i in 1:length(file.bases)) {
        f <- file.bases[i];
        hicfile_paths[i] <- sprintf("%s/%s.allValidPairs", hicproRoot, f);
        if(!file.exists(hicfile_paths[i]))
        {
                stop(sprintf("File does not exist:\n%s\n", hicfile_paths[i]));
        }
}
indexfile<-data.frame()

for(i in 1:length(hicfile_paths)) {
        hicfile_path <- hicfile_paths[i];
        cat(sprintf("\n%d of %d: Starting to process file %s\n\n", i, length(hicfile_paths), hicfile_path));
        output_path<-paste0(outdir,'/', gsub("^(.*[\\/])", "",gsub('.allValidPairs','.txt.gz',hicfile_path)))

        #generate gi_list instance
        gi_list<-generate_bintolen_gi_list(bintolen_path=paste0(outdir,"/hg38_100kb_GATC_bintolen.txt.gz"), gen="Hsapiens",gen_ver="hg38");
        gi_list<-add_hicpro_allvalidpairs_counts(gi_list, allvalidpairs_path = hicfile_path)

        #expand features for modeling
        gi_list<-expand_1D_features(gi_list)

        #run HiC-DC+ on 2 cores
        set.seed(1010) #HiC-DC downsamples rows for modeling
        gi_list<-HiCDCPlus_parallel(gi_list,ssize=0.1, ncore=12)
        for (i in seq(length(gi_list))){
                indexfile<-unique(rbind(indexfile, as.data.frame(gi_list[[i]][gi_list[[i]]$qvalue<=0.05])[c('seqnames1', 'start1','start2')]))
        }

        #write results to a text file
        gi_list_write(gi_list,fname=output_path) # <-- this will fail
        rm(gi_list)
}
csijcs commented 3 years ago

Hello @mervesa (and greetings also to my old friend @jaavedm),

My code is basically the same as @jaavedm, but I also received the error with a single sample and even a single chromosome from the gi_list. I was able to solve the gi_list_write() problem by increasing the buffMB option in the data.table::fwrite call of the source code.

And now the hicdcdiff() function is working properly with the full table of all rows.

mervesa commented 3 years ago

Hi @jaavedm , @csijcs , I have reproduced the warning when I had data.table versions up to 1.13.6, but the warning does not appear when I run on a session with the latest version of data.table (released December 30th, 2020 and involves a data.table::fwrite fix). Could you please try upgrading data.table and see if the warning still appears on your end? I will need to put a version dependency on data.table if it doesn't appear anymore. Thanks a lot!

csijcs commented 3 years ago

The gi_list_write() function is working properly now with data.table_1.13.7 on my end.

jaavedm commented 3 years ago

Unfortunately, the error still occurs with 1.13.7 data.table package. I only found 1.13.6 on CRAN, so I had to update to the latest data.table dev build using data.table::update.dev.pkg()

I tried going back down to an older version (1.13.2) and the problem still occurs.

In the end I managed to get this working by setting data.table threads to 1. By default getDTthreadssaid it was using 56 cores. I set threads using setDTthreads(threads = 1); I'm not sure why this worked, but it fixed the R_Reprotect errors. I suspect there is something about forking and memory handling not behaving properly in multithreaded settings. Also, I don't know how much of a performance impact setting threads = 1 will affect this package.