nsheff / LOLA

Locus Overlap Analysis: Enrichment of Genomic Ranges
http://code.databio.org/LOLA
70 stars 19 forks source link

0-based bed files in database are not properly translated to 1-based regions for LOLA #15

Closed ktrns closed 7 years ago

ktrns commented 7 years ago

Dear Nathan,

I started using LOLA a few days ago and am coming across an issue for which I hope you can help. When I read my database with loadRegionDB containing bed files, the conversion of 0-based bed files to 1-based regions in LOLA doesn't quite work. Hence I am getting a few overlaps between my userSets and the database that aren't correct.

For example, my bed file contains the following lines: scaffold_19 23641645 23642149 scaffold_13 24500587 24501311 scaffold_0 23378037 23378626

After loading the database, this looks like: seqnames ranges strand

1 scaffold_19 [23641645, 23642149] * 2 scaffold_13 [24500587, 24501311] * 3 scaffold_0 [23378037, 23378626] * ... when actually the left coordinate should be +1. Reading my _userSet_ and _userUniverse_ with _makeGRangesFromDataFrame(df=userSet, starts.in.df.are.0based=TRUE)_ works. I am loading the latest LOLA version that I just downloaded from github, since I saw you were working on this issue 20 days ago. Can you comment? Am I doing anything wrong? Thank you in advance!
nsheff commented 7 years ago

I will look into this. See #14 -- Are you using a custom database? Is it cached?

Maybe the database isn't correctly going through readBed.

ktrns commented 7 years ago

Yes, I am using a custom database. Since I am getting this warning I don't think the database is cached:

You don't have simpleCache installed, so you won't be able to cache the regionDB after reading it in. Install simpleCache to speed up later database loading.

nsheff commented 7 years ago

I cannot reproduce this error. Make sure you're using the latest version of LOLA ( install with devtools::install_github("nsheff/LOLA").

If you can produce a reproducible example, I can look at it again.

In this example:

dbPath = system.file("extdata", "hg19", package="LOLA")
regionDB = loadRegionDB(dbLocation=dbPath)

The database in R looks like:

      1     chr1     [ 28736,  29810]      *
      2     chr1     [135125, 135563]      *
      3     chr1     [327791, 328229]      *

while the file looks like:

chr1    28735   29810
chr1    135124  135563
chr1    327790 328229

So, it is in fact correctly adding 1 to the left coordinate. If your version of LOLA is not adding 1 to the left side, make sure you're using the latest version.

EDIT: fix example code

ktrns commented 7 years ago

Hi Nathan, Could you please try these lines of code for me:

library(LOLA)
dbPath <- "fakeDatabase"
regionDB <- loadRegionDB(dbPath)
regionDB

with the attached fakeDatabase? I think I am using the latest version, so I don't understand it.

The file is:

scaffold_19 23641645    23642149
scaffold_13 24500587    24501311
scaffold_0  23378037    23378626
...

And in R it looks like:

         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
    1 scaffold_19 [23641645, 23642149]      *
    2 scaffold_13 [24500587, 24501311]      *
    3  scaffold_0 [23378037, 23378626]      *

Thanks, Katrin

fakeDatabase.zip

nsheff commented 7 years ago

Hi Katrin, what does your sessionInfo say? You're likely not using the latest version. With your database, I get:

seqnames               ranges strand
           <Rle>            <IRanges>  <Rle>
   1 scaffold_19 [23641646, 23642149]      *
   2 scaffold_13 [24500588, 24501311]      *
   3  scaffold_0 [23378038, 23378626]      *
ktrns commented 7 years ago

This is my sessionInfo:

> sessionInfo("LOLA")
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
[1] C

attached base packages:
character(0)

other attached packages:
[1] LOLA_1.5.0

loaded via a namespace (and not attached):
 [1] zlibbioc_1.18.0      IRanges_2.6.1        graphics_3.3.1      
 [4] XVector_0.12.1       parallel_3.3.1       GenomicRanges_1.24.3
 [7] utils_3.3.1          grDevices_3.3.1      stats_3.3.1         
[10] datasets_3.3.1       S4Vectors_0.10.3     data.table_1.9.6    
[13] methods_3.3.1        BiocGenerics_0.18.0  chron_2.3-47        
[16] GenomeInfoDb_1.8.7   stats4_3.3.1         base_3.3.1
nsheff commented 7 years ago

you're using the LOLA version from bioconductor. the correction is on github and hasn't been put in bioc yet. please see my previous comment.

I cannot reproduce this error. Make sure you're using the latest version of LOLA ( install with devtools::install_github("nsheff/LOLA").

ktrns commented 7 years ago

Thank you, it is working now.