sartorlab / methylSig

R package for DNA methylation analysis
17 stars 5 forks source link

chrM extends way past chromosome limit on tiling #12

Closed rcavalcante closed 8 years ago

rcavalcante commented 8 years ago
    chr     start       end  name        pval strand diff_meth      mu1
  (chr)     (int)     (int) (chr)       (dbl)  (chr)     (dbl)    (dbl)
1  chrM 191027979 191028078 hyper 0.003092426      * 58.201220 79.99999
2  chrM 191028079 191028178  none 0.590062248      * -7.093701 81.49841
3  chrM 191042379 191042478  none 0.059321331      * 20.840816 92.85577
4  chrM 191042479 191042578  none 0.721632329      *  2.067847 85.09543
5  chrM 191042579 191042678  none 0.085214089      *  4.085556 91.55898
6  chrM 191043179 191043278 hyper 0.001262003      *  8.510410 88.31157

Not sure how this happens, at the moment.

yongparkpitt commented 8 years ago

Did you check whether there are data in that range? Yongseok


Yong Seok Park, Ph.D. Assistant Professor, Department of Biostatistics Graduate School of Public Health University of Pittsburgh 306 Parran Hall 130 DeSoto Street Pittsburgh, PA 15261 Telephone: 412-624-3028tel:412-624-3028 Facsimile: 412-624-2183tel:412-624-2183 Email: yongpark@pitt.edu

From: Raymond Cavalcante [mailto:notifications@github.com] Sent: Thursday, December 17, 2015 9:57 AM To: sartorlab/methylSig Subject: [methylSig] chrM extends way past chromosome limit on tiling (#12)

chr     start       end  name        pval strand diff_meth      mu1

(chr) (int) (int) (chr) (dbl) (chr) (dbl) (dbl)

1 chrM 191027979 191028078 hyper 0.003092426 * 58.201220 79.99999

2 chrM 191028079 191028178 none 0.590062248 * -7.093701 81.49841

3 chrM 191042379 191042478 none 0.059321331 * 20.840816 92.85577

4 chrM 191042479 191042578 none 0.721632329 * 2.067847 85.09543

5 chrM 191042579 191042678 none 0.085214089 * 4.085556 91.55898

6 chrM 191043179 191043278 hyper 0.001262003 * 8.510410 88.31157

Not sure how this happens, at the moment.

— Reply to this email directly or view it on GitHubhttps://github.com/sartorlab/methylSig/issues/12.

rcavalcante commented 8 years ago

The underlying data is exactly the same for the two runs. Below are all 8 chrM sites reported before I made the v0.4.0 changes.

chrM    3583    3682    *   1   1   0   0   1e+06   4   0   0
chrM    3683    3782    *   0.469399486545457   0.689681821494417   -0.173534869343378  0.637307707220316   1e+06   4   0   0.173534869343378
chrM    6883    6982    *   1   1   0   0   1e+06   4   0   0
chrM    10583   10682   *   1   1   0   0   1e+06   4   0   0
chrM    11583   11682   *   1   1   0   0   1e+06   4   0   0
chrM    12083   12182   *   1   1   0   0   1e+06   4   0   0
chrM    13783   13882   *   1   1   0   0   1e+06   4   0   0
chrM    16383   16482   *   0.090155962573807   0.414411068945746   0.258474580502315   4.94858437404037    1e+06   4   0.301976308965407   0.0435017284630926

Here is the change I made to the tiling function:

# Need to deal with the very last end
data.MAXBASE = uniqueStartList%%MAXBASE + win.size - 1
data.MAXBASE[length(data.MAXBASE)] = data.MAXBASE[length(data.MAXBASE)] - win.size + 2

The expression uniqueStartList%%MAXBASE + win.size - 1 was originally used for data.end in methylSig.newData(), so changing the very last entry to make it shorter and using data.MAXBASE for data.end shouldn't be causing this problem.

I made some test data and a test designed to check that the last tiled region was truncated to avoid going over a chromosome boundary and it worked, and didn't exhibit the behavior above.

rcavalcante commented 8 years ago

And, no, looking at the input for methylSig, there is no data in that range. Moreover, among all the 4 input files, there are only 339 unique chrM sites, whereas the result from v0.4.0 gives 13,919.

rcavalcante commented 8 years ago

I think I've pinned down the problem to a portion of R/read.R (not the tiling function) that filters based on minCount and maxCount.

Quick context: To get minCount to behave as expected, I moved count filtering to after all the files are read so that the destranding aggregation would allow sites to be saved when their combination was above minCount.

I added the block below to change, what I thought, were all the relevant columns:

# Filter by minCount and maxCount while also modifying chr, uniqueLoc, strand, numCs, numTs
# This is in the idiom of what's been implemented
countInvalidList = NULL
for(fileIndex in 1:n.files) {
  # Record the countInvalidList
  countInvalidList = unique( c(countInvalidList, which(
    ((coverage[, fileIndex] != 0) & (coverage[, fileIndex] < minCount)) |
    ((coverage[, fileIndex] != 0) & (coverage[, fileIndex] > maxCount)) ) ) )
}

uniqueChr = uniqueChr[-countInvalidList]
uniqueLoc = uniqueLoc[-countInvalidList]
strand = strand[-countInvalidList]
coverage = coverage[-countInvalidList, ]
numCs = numCs[-countInvalidList, ]
numTs = numTs[-countInvalidList, ]
rcavalcante commented 8 years ago

For future reference, in case I need to do this again. I created a new revert branch off the current master and reverted to a commit from before I made all the changes that seem to cause the problem:

cd ~/latte/methylSig
git checkout master
git branch revert
git checkout revert
git reset e2fdfd3
git reset --soft HEAD@{1}
git commit -m "Revert to e2fdfd3"
git reset --hard
rcavalcante commented 8 years ago

It's looking like the problem is my trying to filter out uniqueChr. I realize that I would have seen that problem in the unit test if the first row was in the countInvalidList. Then the only uniqueChr in the test data would have been removed, and an error would have likely been thrown.

rcavalcante commented 8 years ago

Fixed with commit aa0eb22edf33d6ba182860f9b31fd669c8e64dfe.