kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

Open mpiersonsmela opened 5 months ago

mpiersonsmela commented 5 months ago

I am consistently getting the following error while running permutations. It doesn't always happen on the same permutation. Example output:

Beginning permutation 1 ...Chromosome chr1: Smoothed (1.26 min). 38 CpG(s) excluded due to zero coverage. 72 regions scored (0.07 min). ...Chromosome chr2: Smoothed (1.87 min). 10 regions scored (0.04 min). ...Chromosome chr3: Smoothed (1.34 min). 6 regions scored (0.03 min). ...Chromosome chr4: Smoothed (0.68 min). 11 regions scored (0.04 min). ...Chromosome chr5: Smoothed (1.65 min). 2 CpG(s) excluded due to zero coverage. Error in .start_as_unnamed_integer(start) : each range must have a start that is < 2^31 and > - 2^31 Calls: dmrseq ... .new_IRanges_from_start_end -> .start_as_unnamed_integer In addition: Warning messages: 1: In FUN(X[[i]], ...) : no non-missing arguments to min; returning Inf 2: In FUN(X[[i]], ...) : no non-missing arguments to max; returning -Inf Execution halted

I am using the latest version of all packages.

mpiersonsmela commented 5 months ago

The input command was: naive_new_reg <- dmrseq(bs = sort(oog_naive_new), testCovariate=testCovariate, block=TRUE, minInSpan=150, bpSpan=5000, maxGapSmooth=10000)

I previously ran it with default values for minInSpan, bpSpan, and maxGapSmooth but it gave a warning to increase them. (Also, what values are recommended for running with block=TRUE?)

kdkorthauer commented 5 months ago

Hi @mpiersonsmela,

Thanks for reporting this issue. I'm not able to determine the root cause from the output alone. Can you let me know what (if any) filtering you've done on the input object (e.g. removing loci with low coverage across samples)? If you're able to share a small portion of your input data that reproduces the output (e.g. chromosome 5 in the example above - you can use set.seed() to provide a reproducible example), that would greatly help. You may upload the small example input here.

Regarding your question about parameter settings when using block=TRUE, you may refer to the documentation section here which has some guidance on how to interpret the parameters. The bpSpan parameter governs the size of the window over which methylation values are smoothed, as long as they have at least minInSpan covered loci that are spaced apart my a max of maxGapSmooth. Your settings seem like a reasonable starting point, but in general I'd recommend examining the output candidate regions after scaling up with blocks - if you still see many candidate regions that are relatively close together, where you'd otherwise interpret them as one contiguous region, then I'd recommend to scale up all of those parameters further to make sure the smoothing is at a more relevant scale. Hope that helps!

mpiersonsmela commented 5 months ago

OK, I'll upload the example data tomorrow. Regarding the filtering, I'm removing all the loci that have zero reads for all samples in a group. This filtering strategy worked OK when not running block=TRUE.

This is the filtering code I'm using:

#Read in packages
suppressPackageStartipMessages(library(tidyverse))
suppressPackageStartipMessages(library(magrittr))
suppressPackageStartipMessages(library(bsseq))
suppressPackageStartipMessages(library(dmrseq))
suppressPackageStartipMessages(library(BiocParallel))
suppressPackageStartipMessages(library(argparse))
register(MulticoreParam(16)) 

# Make sure metadata has the following structure:
#    Files   | Group
# -----------------------
# <file_name>| <int 1 or 2>

args = commandArgs(trailingOnly=TRUE)

metadata_path <- args[1]

# Read in data
metadata <- read.csv(metadata_path)

# get file names
cov_files <- metadata %>%
  pull(Files)

# make BSseq object
bismark_cov <- read.bismark(files = cov_files, rmZeroCov = TRUE)

# trt = vector of condition labels for each sample
trt <- metadata %>%
  pull(Group)
pData(bismark_cov)$Condition <- trt

# filter for loci that are not 0 in all samples of a condition
## get group 1 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 1)
cov_temp_1 <- bismark_cov[, sample_1]
loci_1_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_1, type="Cov")) > 0)

## get group 2 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 2)
cov_temp_2 <- bismark_cov[, sample_2]
loci_2_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_2, type="Cov")) > 0)

# only keep loci that have at least 1 read in both sample groups 
loci.idx <- sort(unique(intersect(loci_1_keep, loci_2_keep)))
bismark_cov_filt <- bismark_cov[loci.idx,]

testCovariate <- "Condition"
bismark_cov_dmr <- dmrseq(bs = sort(bismark_cov_filt),
                          testCovariate = testCovariate, minInSpan=150, bpSpan=5000, maxGapSmooth=10000, block=TRUE)
mpiersonsmela commented 5 months ago

OK, I uploaded the files to your Dropbox link. I wasn't able to figure out how to take a small portion that replicates the results, so I just uploaded the whole thing.

kdkorthauer commented 4 months ago

Thanks again for reporting this issue and for sending me your files to debug. It took me a while to track it down, but I found the implicated line that didn't properly filter loci with zero coverage in all samples of one permuted condition in the permutations during the construction of candidate blocks. I suspect I never encountered this before because I typically use more stringent filters for coverage. I pushed a fix, and was able to run your code successfully. You can access this version on the devel branch of GitHub immediately (BiocManager::install_github("kdkorthauer/dmrseq", "devel"), and it will also be available in Bioconductor Devel (version 3.19) in either today or Monday's build. Assuming it passes all checks, I'll port it over to the current Bioc release version (version 3.18) early next week.

As a side note, I noticed that the number of loci in your dataset and the high number of consecutive genomic coordinates look like you may be including loci on both strands separately. You may consider collapsing strands. Happy to provide some more information on that if you're interested.

Thanks again for helping me to improve the software.

mpiersonsmela commented 4 months ago

Thanks! I'll follow up with you by email about collapsing strands.

mpiersonsmela commented 4 months ago

Regarding collapsing strands, does this basically mean treating each CpG site as one genomic coordinate? I am not sure this would be appropriate for me, since I expect a substantial amount of hemimethylated sites in my data. But I am pretty new to methylation analysis so I would appreciate it if you could explain more. Thanks for all your help.

On Sat, Feb 17, 2024 at 1:07 PM Keegan Korthauer @.***> wrote:

Thanks again for reporting this issue and for sending me your files to debug. It took me a while to track it down, but I found the implicated line that didn't properly filter loci with zero coverage in all samples of one permuted condition in the permutations during the construction of candidate blocks. I suspect I never encountered this before because I typically use more stringent filters for coverage. I pushed a fix, and was able to run your code successfully. You can access this version on the devel branch of GitHub immediately (BiocManager::install_github("kdkorthauer/dmrseq", "devel"), and it will also be available in Bioconductor Devel (version 3.19) in either today or Monday's build. Assuming it passes all checks, I'll port it over to the current Bioc release version (version 3.18) early next week.

As a side note, I noticed that the number of loci in your dataset and the high number of consecutive genomic coordinates look like you may be including loci on both strands separately. You may consider collapsing strands. Happy to provide some more information on that if you're interested.

Thanks again for helping me to improve the software.

— Reply to this email directly, view it on GitHub https://github.com/kdkorthauer/dmrseq/issues/64#issuecomment-1950272338, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3OAGJ5WN4MHZMPPGGWW63YUDWWTAVCNFSM6AAAAABCBIHFAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJQGI3TEMZTHA . You are receiving this because you were mentioned.Message ID: @.***>

kdkorthauer commented 4 months ago

Yes, exactly - collapsing would treat each CpG as one observation. If you are interested in hemimethylation, then it would not be appropriate to collapse.