Open julie-sullivan opened 4 years ago
At least in 98, there is a file with all the Motifs called _Homo_sapiens.GRCh38.motiffeatures.gff.gz:
ftp://ftp.ensembl.org/pub/release-98/regulation/homo_sapiens/MotifFeatures/
the new file is very large, 10 GB.
Plan A: load these data but filter to only get the "good ones" Plan B: use MOODS to generate the file ourselves.
http://www.ensembl.info/2018/10/15/new-ensembl-motif-features/
I tried to run MOODS myself
~/git/MOODS/python/scripts (master) $ python moods-dna.py -m ~/data/jaspar/*.pfm -s ~/data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -p 0.0001
Segmentation fault (core dumped)
Here's a discussion of the score: https://github.com/jhkorhonen/MOODS/wiki/Brief-theoretical-introduction
JASPAR file:
>MA0642.1 EN2
A [ 402 212 10 1648 1648 0 0 1648 350 181 ]
C [ 564 681 1024 297 105 0 420 0 491 708 ]
G [ 456 412 35 164 7 0 203 45 630 476 ]
T [ 225 343 624 53 0 1648 1648 0 177 283 ]
Example file:
0 3 79 40 66 48 65 11 65 0
94 75 4 3 1 2 5 2 3 3
1 0 3 4 1 0 5 3 28 88
2 19 11 50 29 47 22 81 1 6
maybe that's my problem?
If I format the JASPAR file, here is a snippet:
./python/scripts/moods-dna.py -m /home/julie/data/matrices/*.pfm -s /home/julie/data/Homo_sapiens.GRCh38.dna.chromosome.22.fa -p 0.0001
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10511859,-,7.53854401973,AAGCCACAGTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10514271,-,7.39889409503,AAAACACAAAC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10514592,-,7.5969371834,GAACCACAACG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10519785,+,7.45378788769,AACTGTGGTTG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10521540,+,8.15256600968,CTTTGTGGTTC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10525467,+,8.89032049779,TTTTGTGGCTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10527203,+,10.0985299135,TTTTGTGGTTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10527462,+,8.28082852686,CTCTGTGGCCT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10532985,-,7.65367162579,CAACCACAACT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10535509,+,8.90592680995,GTGTGTGGTTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10541588,-,8.15853957972,AAACCACATCC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10546258,-,8.74675343539,AAACCACAGTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10546679,+,8.20930045207,ATTTGTGGTTG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10551713,-,7.55124088359,AAAACACAGAG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10557571,+,8.33000240003,TTCTGTGGCCT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10559486,+,7.58990388363,TGTTGTGGCTA,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10563686,+,10.0493560403,CTTTGTGGTTT,
Stats: for one matrix (MA0002.2.pfm) for one chromosome (22), there were 15705 motifs.
If I put the p-value to 0.0000001
, there are 53 motifs for the same data files.
For reference, here's a snippet of the ensembl motif file:
22 . TF_binding_site 10526203 10526214 5.09674513946 + . binding_matrix_stable_id=ENSPFM0364;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207070093;transcription_factor_complex=MAFK%2CNRL
22 . TF_binding_site 10526207 10526221 6.8114068466 - . binding_matrix_stable_id=ENSPFM0363;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207094316;transcription_factor_complex=MAFF%2CMAFG%2CMAFK
22 . TF_binding_site 10526593 10526604 3.95053406962 + . binding_matrix_stable_id=ENSPFM0364;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207077101;transcription_factor_complex=MAFK%2CNRL
22 . TF_binding_site 10526597 10526611 3.88277292115 - . binding_matrix_stable_id=ENSPFM0363;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207106222;transcription_factor_complex=MAFF%2CMAFG%2CMAFK
I don't see these features on the ensembl browser, so they likely do some filtering.
@imedina The new file generated by Ensembl will solve our problem for human. Do we need to solve this problem for other species?
[main] WARN org.opencb.cellbase.lib.download.DownloadManager - ftp://ftp.ensemblgenomes.org/pub/release-29/metazoa/regulation/drosophila_melanogaster/MotifFeatures.gff.gz cannot be downloaded
Hi Julie,
I just realized that it IS actually possible to filter the existing .gff file.
This relatively simple grep command is all you need in order to filter the file and create a new one that only contains the experimentally verified motif features.
grep 'experimental_evidence' Homo_sapiens.GRCh38.motif_features.gff >Homo_sapiens.GRCh38.motif_features_experimental_evidence.gff
Please let me know if this is good enough for you.
Thanks,
Line count | File name |
---|---|
350,972,489 | Homo_sapiens.GRCh38.motif_features.gff |
11,978,682 | Homo_sapiens.GRCh38.motif_features_experimental_evidence.gff |
This looks promising, I will take a look next days
The MotifFeatures.gff file has been renamed for Ensembl version 89. We should update to Ensembl 90 as this will fix this and another problem. However, they have split this file into chromosomes for the most recent Ensembl version so we'll have to update this parser again anyway.