nsheff / LOLA

Locus Overlap Analysis: Enrichment of Genomic Ranges
http://code.databio.org/LOLA
70 stars 19 forks source link

Inclusion of mm10 JASAPR prediction track into LOLA JASPAR #28

Open holgerbrandl opened 5 years ago

holgerbrandl commented 5 years ago

Would it be possible to integrate the recently published JSAPR binding prediction for mm10 into Lola Jaspar? See http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/mm10/ for the data ,http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/ for the method details, and also http://jaspar.genereg.net/genome-tracks/ for a general overview of the new track feature of JASPAR.

Since it's already a region dataset (bed-file) a conversion into the LOLA db format may be straightforward.

holgerbrandl commented 5 years ago

Any news? both positive or negative would be helpful to me.

nsheff commented 5 years ago

I am looking into it now.

nsheff commented 5 years ago

Hi @holgerbrandl, I've looked into it; there are 9 billion regions in that file. I'm not sure LOLA will be able to do that in its entirety. Do you have any ideas for reducing that?

If you want you can give it a try. This code will split the file into individual bed files for each factor:

wget http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/mm10/JASPAR2018_mm10_all_chr.bed.gz
mkdir -p mm10/jaspar2018/regions
time zcat JASPAR2018_mm10_all_chr.bed.gz | sed s/[:\.\(\)]/_/g | sed s/__/_/g | awk '{print $_ > "mm10/jaspar2018/regions/"$4".bed"}'

LOLA should be able to load them (see here: http://databio.org/regiondb). I'm trying this now. But I think if you don't have a lot of memory, that's probably going to be problematic...