Open holgerbrandl opened 5 years ago
Any news? both positive or negative would be helpful to me.
I am looking into it now.
Hi @holgerbrandl, I've looked into it; there are 9 billion regions in that file. I'm not sure LOLA will be able to do that in its entirety. Do you have any ideas for reducing that?
If you want you can give it a try. This code will split the file into individual bed files for each factor:
wget http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/mm10/JASPAR2018_mm10_all_chr.bed.gz
mkdir -p mm10/jaspar2018/regions
time zcat JASPAR2018_mm10_all_chr.bed.gz | sed s/[:\.\(\)]/_/g | sed s/__/_/g | awk '{print $_ > "mm10/jaspar2018/regions/"$4".bed"}'
LOLA should be able to load them (see here: http://databio.org/regiondb). I'm trying this now. But I think if you don't have a lot of memory, that's probably going to be problematic...
Would it be possible to integrate the recently published JSAPR binding prediction for mm10 into Lola Jaspar? See http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/mm10/ for the data ,http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/ for the method details, and also http://jaspar.genereg.net/genome-tracks/ for a general overview of the new track feature of JASPAR.
Since it's already a region dataset (bed-file) a conversion into the LOLA db format may be straightforward.