kdkorthauer / dmrseq

R package for Inference of differentially methylated regions (DMRs) from bisulfite sequencing
MIT License
54 stars 14 forks source link

A Bug in bumphunt(): cause index problems in chr2, chr10, chrX and chrM in dmrseq() results #37

Closed zyqfrog10 closed 4 years ago

zyqfrog10 commented 4 years ago

Please go down to my 4th post for the origin of the problem and suggested solution. My 2nd post down questioned the usage of meanDiff() in the dmrseq vignettes. ----------------------------------Original post------------ Hi @kdkorthauer,

In my recent tries on dmrseq, the performance of meanDiff() has not been stable. Sometimes it throws out an error (for different data) like, Error in .Primitive("[")(c(1, 1, 1, 1, 1, 0.75, 1, 0.5, 1, 1, 0.666666666666667, : subscript out of bounds. I tried to track down the issue and found the indexRanges recorded in regions can exceed the number of CpG loci in bs.

In my data, it looks like the following. There are indices greater than the total number of CpGs. It might be related to how bumphunt track along CpG indices which I haven't gone to the bottom yet. I would be grateful if you can help me identify the problem. Thanks a lot.

> dim(getCoverage(bs.filt, type="M"))
[1] 7642620      15
> indexRanges <- IRanges(start(regions$index),end(regions$index))
> head(sort(as.data.frame(indexRanges)$end,decreasing = TRUE),n=40)
[1] 7783471 7783439 7783109 7782725 7782601 7781587 7781323 7781290 7780476
[10] 7777322 7777213 7777172 7775810 7775307 7774941 7774906 7774880 7774747
[19] 7774738 7774580 7774511 7774312 7773846 7773812 7772719 7772693 7769340
[28] 7767532 7767111 7764399 7762499 7762417 7762394 7761826 7761108 7760764
[37] 7760082 7760077 7759864 7759409
> head(sort(as.data.frame(indexRanges)$start,decreasing = TRUE),n=40)
 [1] 7783467 7783431 7783105 7782717 7782590 7781581 7781319 7781282 7780463
[10] 7777314 7777207 7777166 7775802 7775298 7774932 7774898 7774876 7774740
[19] 7774726 7774573 7774502 7774306 7773839 7773807 7772711 7772685 7769334
[28] 7767524 7767103 7764395 7762492 7762413 7762389 7761820 7761101 7760755
[37] 7760078 7760072 7759859 7759401

The codes to obtain the bs.filt and regions above.

loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(bs, type="Cov")==0) == 0)
bs.filt <- bs[loci.idx,]

regions <- dmrseq(bs.filt, 
                      testCovariate = "condition",
                      bpSpan = 500,
                      maxGap = 500, maxPerms = 10,
                      verbose = TRUE)
zyqfrog10 commented 4 years ago

Hello @kdkorthauer ,

I went through the toy sample BS.chr21 again to get some ideas on how meanDiff() should be used. Here is what I did,

> library(dmrseq)
> data(BS.chr21)
> testCovariate <- "CellType"
> regions <- dmrseq(bs=BS.chr21[240001:260000,],cutoff=0.05,testCovariate=testCovariate)

The above lines follow the dmrseq vignettes

> dim(getCoverage(BS.chr21[240001:260000,], type="M"))
[1] 20000     4
> indexRanges <- IRanges(start(regions$index),end(regions$index))
> head(sort(as.data.frame(indexRanges)$end,decreasing = TRUE),n=40)
 [1] 19756 19658 19538 19415 19387 19359 19339 19288 19167 19081 19061 19029
[13] 19002 18965 18758 18719 18683 18616 18585 18474 18443 18363 18311 18179
[25] 18143 18136 17943 17902 17818 17744 17727 17704 17670 17591 17534 17517
[37] 17448 17385 17366 17347

In this case, indexRanges are smaller than the number of queried CpGs.

> rawDiff <- meanDiff(BS.chr21, dmrs=regions,testCovariate="CellType") ## This is the example showed in dmrseq vignettes
> rawDiff2 <- meanDiff(BS.chr21[240001:260000,], dmrs=regions,testCovariate="CellType") ## This is what I think how the function should be used
> head(rawDiff)
[1] -0.477770804  0.017534906 -0.006700583 -0.017879438 -0.386437073
[6] -0.166225704
> head(rawDiff2)
[1] -0.5078933 -0.4961164 -0.1501904 -0.1429204 -0.1203061 -0.3277826
> sum(rawDiff)
[1] -89.28441
> sum(rawDiff)
[1] -27.86785

Clearly, meanDiff() yields different results for BS.chr21 and BS.chr21[240001:260000,], a subset of BS.chr21, with the same regions. It may indicate that dmrseq() does not carry the original CpG indices in the full set to the subset and the regions may only inherit the CpG indices (i.e. the row number?) from the subset. As a consequence, the above rawDiff2 should be used instead of the rawDiff in the vignettes, right?

Well, however, this exploration could not explain the overflow of CpG indices in my previous question above.

Looking forward to your insights. Thank you. YZ

zyqfrog10 commented 4 years ago

Hi @kdkorthauer ,

I think I figured it out. There is a small issue in the bumphunt() as I suspected yesterday. When chrM is present, the order of chromosome iteration is inconsistent with the order in the chrlengths table. The issue happens in the following chunk where the indices can be messed up.

# In the bumpbunt() function
## convert index from within chromosome to overall, 2nd half
 chrs <- unique(seqnames(bs))
 chrlengths <- cumsum(table(as.character(seqnames(bs))))
 for (j in seq_len(length(chrs)-1)) {
        ch <- chrs[[j+1]]
        tab$indexStart[tab$chr %in% ch] <- tab$indexStart[tab$chr %in% ch] +
          chrlengths[names(chrlengths) == chrs[j]]
        tab$indexEnd[tab$chr %in% ch] <- tab$indexEnd[tab$chr %in% ch] +
          chrlengths[names(chrlengths) == chrs[j]] 
      }
    }

Let's examine the above line by line using my data,

> chrs <- unique(seqnames(bs))
> chrs
[1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
[13] chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX  chrY  chrM 
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
> chrlengths <- cumsum(table(as.character(seqnames(bs))))
> chrlengths
chr1   chr10   chr11   chr12   chr13   chr14   chr15   chr16   chr17   chr18 
 535664  948444 1387748 1723051 2089779 2413708 2721966 3001896 3311830 3581848 
  chr19    chr2    chr3    chr4    chr5    chr6    chr7    chr8    chr9    chrM 
3802492 4367960 4801613 5266091 5751475 6180585 6586386 6975019 7373219 7373678 
   chrX    chrY 
7642586 7642620 
> for (j in seq_len(length(chrs)-1)) {
+        message("iterate",j)
+         print(chrs2[[j]])
+         print(chrs2[[j+1]])
+         }
iterate1
[1] chr1
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr2
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate2
[1] chr2
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr3
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate3
[1] chr3
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr4
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate4
[1] chr4
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr5
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate5
[1] chr5
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr6
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate6
[1] chr6
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr7
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate7
[1] chr7
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr8
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate8
[1] chr8
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr9
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate9
[1] chr9
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr10
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate10
[1] chr10
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr11
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate11
[1] chr11
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr12
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate12
[1] chr12
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr13
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate13
[1] chr13
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr14
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate14
[1] chr14
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr15
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate15
[1] chr15
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr16
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate16
[1] chr16
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr17
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate17
[1] chr17
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr18
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate18
[1] chr18
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chr19
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate19
[1] chr19
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chrX
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate20
[1] chrX
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chrY
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
iterate21
[1] chrY
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM
[1] chrM
22 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrM

As shown here, indices of CpG in chrM will be the last one in the loop to be converted based on the index cumsum in chrY. But this is inconsistent with the order in chrlengths table where chrY is actually the last chromosome. As a consequence, the indices are not correctly assigned to chrX, chrY, and chrM.

I would suggest apply str_sort() in the bumphunt() to correct this problem. Otherwise, I would suggest exclude chrM in the analysis.

chrs <- str_sort(unique(seqnames(bs)))

Hope this helps. Thanks. YZ

kdkorthauer commented 4 years ago

Hi @zyqfrog10,

Thank you for reporting the bug, and for tracking down the source of the problem!

I was able to confirm that the chromsome reordering in the bumphunt() internal function was indeed not kept intact when chrM was included. I've just pushed a fix to GitHub, which should be available in Bioc Devel in the next couple of days. Please try installing the updated version, and let me know if that solves the problem on your dataset.

I'm going to do some more testing, will push the fix to release once I'm confident it solves the problem you were seeing.

Best, Keegan

zyqfrog10 commented 4 years ago

Hi @kdkorthauer ,

Thanks for your quick response. Just want to add to my previous post. In the original script, the indices were messed up not only to chrX and chrM. As shown in the loop print out in my previous post, chr2 indices depend on chr1 but in the chrlengths table chr2 indices should add on chr19, and similarly in chr10, chrX and chrM if there.

I just tried sort(bs) first as suggested in the updated bumphunt(). But it couldn't solve the issue.

> bs.sort <- sort(bs)
> cumsum(table(as.character(seqnames(bs.sort))))
   chr1   chr10   chr11   chr12   chr13   chr14   chr15   chr16   chr17   chr18 
 535664  948444 1387748 1723051 2089779 2413708 2721966 3001896 3311830 3581848 
  chr19    chr2    chr3    chr4    chr5    chr6    chr7    chr8    chr9    chrX 
3802492 4367960 4801613 5266091 5751475 6180585 6586386 6975019 7373219 7642127 
   chrY 
7642161 
> unique(seqnames(bs.sort))
 [1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
[13] chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX  chrY 
21 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrY

As shown above (chrM already excluded), the sequence of chrs are still not matched after sort(bs). I think the problem is that sort(bs) will sort by chromosome as character while unique() function will order differently. For example,

> c <- c("chr1","chr2","chr10","chr11","chrX","chrY","chrM")
> sort(c)
[1] "chr1"  "chr10" "chr11" "chr2"  "chrM"  "chrX"  "chrY" 
> unique(c)
[1] "chr1"  "chr2"  "chr10" "chr11" "chrX"  "chrY"  "chrM" 

In the bs object,

> unique(seqnames(bs.sort)) ## this is a factor
 [1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
[13] chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX  chrY 
21 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrY
> sort(unique(seqnames(bs.sort))) ## sort wouldn't make order same as character
 [1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12
[13] chr13 chr14 chr15 chr16 chr17 chr18 chr19 chrX  chrY 
21 Levels: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 ... chrY

A right way to do this is,

> sort(as.character(unique(seqnames(bs.sort))))
 [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
[10] "chr18" "chr19" "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8" 
[19] "chr9"  "chrX"  "chrY"

Or this if want to keep it as a factor

> as.factor(sort(as.character(unique(seqnames(bs.sort)))))
 [1] chr1  chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 
[13] chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX  chrY 
21 Levels: chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 ... chrY

So in the bumphunt() function, the line should be updated is still the chrs <- unique(seqnames(bs)),

chrs <- as.factor(sort(as.character(unique(seqnames(bs)))))

Now I realize this is not an issue of misorder of chrM but more things, where indices in chr2, chr10, chrX and chrM are affected. Looking forward to your insights. Thanks.

Best, YZ

kdkorthauer commented 4 years ago

Hi @zyqfrog10,

Thank you for following up. I see now that the ordering in the cumulative sum table was still not correctly enforced. The issue is that the table function reorders the chromosomes by factor level. I have accounted for this by explicitly ordering the cumulative sum table to match the order that the chromosomes appear in the bsseq object. This fix is introduced in the most recent commit. Please try it out and let me know if it now behaves as expected.

Best, Keegan

zyqfrog10 commented 4 years ago

Hi @kdkorthauer,

Wonderful! I've tested the updated script. It ran through very well. Thank you very much.

Best, YZ