Closed rcavalcante closed 8 years ago
Did you check whether there are data in that range? Yongseok
Yong Seok Park, Ph.D. Assistant Professor, Department of Biostatistics Graduate School of Public Health University of Pittsburgh 306 Parran Hall 130 DeSoto Street Pittsburgh, PA 15261 Telephone: 412-624-3028tel:412-624-3028 Facsimile: 412-624-2183tel:412-624-2183 Email: yongpark@pitt.edu
From: Raymond Cavalcante [mailto:notifications@github.com] Sent: Thursday, December 17, 2015 9:57 AM To: sartorlab/methylSig Subject: [methylSig] chrM extends way past chromosome limit on tiling (#12)
chr start end name pval strand diff_meth mu1
(chr) (int) (int) (chr) (dbl) (chr) (dbl) (dbl)
1 chrM 191027979 191028078 hyper 0.003092426 * 58.201220 79.99999
2 chrM 191028079 191028178 none 0.590062248 * -7.093701 81.49841
3 chrM 191042379 191042478 none 0.059321331 * 20.840816 92.85577
4 chrM 191042479 191042578 none 0.721632329 * 2.067847 85.09543
5 chrM 191042579 191042678 none 0.085214089 * 4.085556 91.55898
6 chrM 191043179 191043278 hyper 0.001262003 * 8.510410 88.31157
Not sure how this happens, at the moment.
— Reply to this email directly or view it on GitHubhttps://github.com/sartorlab/methylSig/issues/12.
The underlying data is exactly the same for the two runs. Below are all 8 chrM
sites reported before I made the v0.4.0
changes.
chrM 3583 3682 * 1 1 0 0 1e+06 4 0 0
chrM 3683 3782 * 0.469399486545457 0.689681821494417 -0.173534869343378 0.637307707220316 1e+06 4 0 0.173534869343378
chrM 6883 6982 * 1 1 0 0 1e+06 4 0 0
chrM 10583 10682 * 1 1 0 0 1e+06 4 0 0
chrM 11583 11682 * 1 1 0 0 1e+06 4 0 0
chrM 12083 12182 * 1 1 0 0 1e+06 4 0 0
chrM 13783 13882 * 1 1 0 0 1e+06 4 0 0
chrM 16383 16482 * 0.090155962573807 0.414411068945746 0.258474580502315 4.94858437404037 1e+06 4 0.301976308965407 0.0435017284630926
Here is the change I made to the tiling function:
# Need to deal with the very last end
data.MAXBASE = uniqueStartList%%MAXBASE + win.size - 1
data.MAXBASE[length(data.MAXBASE)] = data.MAXBASE[length(data.MAXBASE)] - win.size + 2
The expression uniqueStartList%%MAXBASE + win.size - 1
was originally used for data.end
in methylSig.newData()
, so changing the very last entry to make it shorter and using data.MAXBASE
for data.end
shouldn't be causing this problem.
I made some test data and a test designed to check that the last tiled region was truncated to avoid going over a chromosome boundary and it worked, and didn't exhibit the behavior above.
And, no, looking at the input for methylSig, there is no data in that range. Moreover, among all the 4 input files, there are only 339 unique chrM
sites, whereas the result from v0.4.0
gives 13,919.
I think I've pinned down the problem to a portion of R/read.R
(not the tiling function) that filters based on minCount
and maxCount
.
Quick context: To get minCount
to behave as expected, I moved count filtering to after all the files are read so that the destranding aggregation would allow sites to be saved when their combination was above minCount
.
I added the block below to change, what I thought, were all the relevant columns:
# Filter by minCount and maxCount while also modifying chr, uniqueLoc, strand, numCs, numTs
# This is in the idiom of what's been implemented
countInvalidList = NULL
for(fileIndex in 1:n.files) {
# Record the countInvalidList
countInvalidList = unique( c(countInvalidList, which(
((coverage[, fileIndex] != 0) & (coverage[, fileIndex] < minCount)) |
((coverage[, fileIndex] != 0) & (coverage[, fileIndex] > maxCount)) ) ) )
}
uniqueChr = uniqueChr[-countInvalidList]
uniqueLoc = uniqueLoc[-countInvalidList]
strand = strand[-countInvalidList]
coverage = coverage[-countInvalidList, ]
numCs = numCs[-countInvalidList, ]
numTs = numTs[-countInvalidList, ]
For future reference, in case I need to do this again. I created a new revert
branch off the current master
and reverted to a commit from before I made all the changes that seem to cause the problem:
cd ~/latte/methylSig
git checkout master
git branch revert
git checkout revert
git reset e2fdfd3
git reset --soft HEAD@{1}
git commit -m "Revert to e2fdfd3"
git reset --hard
It's looking like the problem is my trying to filter out uniqueChr
. I realize that I would have seen that problem in the unit test if the first row was in the countInvalidList
. Then the only uniqueChr
in the test data would have been removed, and an error would have likely been thrown.
Fixed with commit aa0eb22edf33d6ba182860f9b31fd669c8e64dfe.
Not sure how this happens, at the moment.