nsheff / LOLA

Locus Overlap Analysis: Enrichment of Genomic Ranges
http://code.databio.org/LOLA
70 stars 19 forks source link

readBed is not consistent with import.bed #14

Closed wcstcyx closed 7 years ago

wcstcyx commented 7 years ago

Hi Nathan,

I installed LOLA 1.4.0 in R 3.3.2 and Bioconductor 3.4 by biocLite("LOLA"). It seems that readBed() and loadRegionDB() would create GRanges without adding 1 to the left sites in the bed files, leading to not so right 1-based GRanges which would latter give out not so right results.

library("LOLA")
bed <- system.file("extdata", "hg19", "ucsc_example", "regions", "cpgIslandExt.bed", package="LOLA")
cat(readLines(bed, n = 5), sep = "\n")
## chr1    28735   29810
## chr1    135124  135563
## chr1    327790  328229
## chr1    437151  438164
## chr1    449273  450544
readBed(bed)
##  GRanges object with 28691 ranges and 0 metadata columns:
##        seqnames               ranges strand
##           <Rle>            <IRanges>  <Rle>
##      1     chr1     [ 28735,  29810]      *
##      2     chr1     [135124, 135563]      *
##      3     chr1     [327790, 328229]      *
##      4     chr1     [437151, 438164]      *
##      5     chr1     [449273, 450544]      *
##    ...      ...                  ...    ...
##  28687     chrY [27610115, 27611088]      *
##  28688     chrY [28555535, 28555932]      *
##  28689     chrY [28773315, 28773544]      *
##  28690     chrY [59213794, 59214183]      *
##  28691     chrY [59349266, 59349574]      *
##  -------
##  seqinfo: 69 sequences from an unspecified genome; no seqlengths
library("rtracklayer")
import(bed)
##  GRanges object with 28691 ranges and 0 metadata columns:
##          seqnames               ranges strand
##             <Rle>            <IRanges>  <Rle>
##      [1]     chr1     [ 28736,  29810]      *
##      [2]     chr1     [135125, 135563]      *
##      [3]     chr1     [327791, 328229]      *
##      [4]     chr1     [437152, 438164]      *
##      [5]     chr1     [449274, 450544]      *
##      ...      ...                  ...    ...
##  [28687]     chrY [27610116, 27611088]      *
##  [28688]     chrY [28555536, 28555932]      *
##  [28689]     chrY [28773316, 28773544]      *
##  [28690]     chrY [59213795, 59214183]      *
##  [28691]     chrY [59349267, 59349574]      *
##  -------
##  seqinfo: 69 sequences from an unspecified genome; no seqlengths

I find that runLOLA() invokes countOverlaps(). However, countOverlaps() is based on 1-based closed interval. So readBed() would in fact add one more base on the left side of all the regions. Even it would have small impact on the result, it maybe better to be fixed. :)

nsheff commented 7 years ago

You are right. BED format is 0-based, but countOverlaps is 1-based, so import.bed is doing it correctly and readBed is not. As you say, for LOLA probably not a big deal, but I will fix this. Thanks!