opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

MotifFeatures.gff doesn't exist in Ensembl 89 #466

Open julie-sullivan opened 4 years ago

julie-sullivan commented 4 years ago

The MotifFeatures.gff file has been renamed for Ensembl version 89. We should update to Ensembl 90 as this will fix this and another problem. However, they have split this file into chromosomes for the most recent Ensembl version so we'll have to update this parser again anyway.

Ensembl Version filename url
< 89 MotifFeatures.gff.gz ftp.ensembl.org/pub/release-89/regulation/homo_sapiens
90 homo_sapiens.GRCh38.motiffeatures.20161111.gff.gz ftp.ensembl.org/pub/release-90/regulation/homo_sapiens/
98 Homo_sapiens.GRCh38.motif_features.gff.gz ftp.ensembl.org/pub/release-98/regulation/homo_sapiens/MotifFeatures/
imedina commented 4 years ago

At least in 98, there is a file with all the Motifs called _Homo_sapiens.GRCh38.motiffeatures.gff.gz:

ftp://ftp.ensembl.org/pub/release-98/regulation/homo_sapiens/MotifFeatures/

julie-sullivan commented 4 years ago

the new file is very large, 10 GB.

Plan A: load these data but filter to only get the "good ones" Plan B: use MOODS to generate the file ourselves.

http://www.ensembl.info/2018/10/15/new-ensembl-motif-features/

julie-sullivan commented 4 years ago

I tried to run MOODS myself

~/git/MOODS/python/scripts (master) $ python moods-dna.py -m ~/data/jaspar/*.pfm -s ~/data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa -p 0.0001
Segmentation fault (core dumped)
julie-sullivan commented 4 years ago
julie-sullivan commented 4 years ago

Here's a discussion of the score: https://github.com/jhkorhonen/MOODS/wiki/Brief-theoretical-introduction

julie-sullivan commented 4 years ago

JASPAR file:

>MA0642.1       EN2
A  [   402    212     10   1648   1648      0      0   1648    350    181 ]
C  [   564    681   1024    297    105      0    420      0    491    708 ]
G  [   456    412     35    164      7      0    203     45    630    476 ]
T  [   225    343    624     53      0   1648   1648      0    177    283 ]

Example file:

 0  3 79 40 66 48 65 11 65  0
94 75  4  3  1  2  5  2  3  3
 1  0  3  4  1  0  5  3 28 88
 2 19 11 50 29 47 22 81  1  6

maybe that's my problem?

julie-sullivan commented 4 years ago

If I format the JASPAR file, here is a snippet:

./python/scripts/moods-dna.py -m /home/julie/data/matrices/*.pfm -s /home/julie/data/Homo_sapiens.GRCh38.dna.chromosome.22.fa -p 0.0001 
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10511859,-,7.53854401973,AAGCCACAGTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10514271,-,7.39889409503,AAAACACAAAC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10514592,-,7.5969371834,GAACCACAACG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10519785,+,7.45378788769,AACTGTGGTTG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10521540,+,8.15256600968,CTTTGTGGTTC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10525467,+,8.89032049779,TTTTGTGGCTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10527203,+,10.0985299135,TTTTGTGGTTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10527462,+,8.28082852686,CTCTGTGGCCT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10532985,-,7.65367162579,CAACCACAACT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10535509,+,8.90592680995,GTGTGTGGTTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10541588,-,8.15853957972,AAACCACATCC,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10546258,-,8.74675343539,AAACCACAGTT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10546679,+,8.20930045207,ATTTGTGGTTG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10551713,-,7.55124088359,AAAACACAGAG,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10557571,+,8.33000240003,TTCTGTGGCCT,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10559486,+,7.58990388363,TGTTGTGGCTA,
22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF,MA0002.2.pfm,10563686,+,10.0493560403,CTTTGTGGTTT,

Stats: for one matrix (MA0002.2.pfm) for one chromosome (22), there were 15705 motifs.

If I put the p-value to 0.0000001, there are 53 motifs for the same data files.

julie-sullivan commented 4 years ago

For reference, here's a snippet of the ensembl motif file:

22      .       TF_binding_site 10526203        10526214        5.09674513946   +       .       binding_matrix_stable_id=ENSPFM0364;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207070093;transcription_factor_complex=MAFK%2CNRL
22      .       TF_binding_site 10526207        10526221        6.8114068466    -       .       binding_matrix_stable_id=ENSPFM0363;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207094316;transcription_factor_complex=MAFF%2CMAFG%2CMAFK
22      .       TF_binding_site 10526593        10526604        3.95053406962   +       .       binding_matrix_stable_id=ENSPFM0364;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207077101;transcription_factor_complex=MAFK%2CNRL
22      .       TF_binding_site 10526597        10526611        3.88277292115   -       .       binding_matrix_stable_id=ENSPFM0363;epigenomes_with_experimental_evidence=HepG2;stable_id=ENSM00207106222;transcription_factor_complex=MAFF%2CMAFG%2CMAFK

I don't see these features on the ensembl browser, so they likely do some filtering.

julie-sullivan commented 4 years ago

@imedina The new file generated by Ensembl will solve our problem for human. Do we need to solve this problem for other species?

[main] WARN org.opencb.cellbase.lib.download.DownloadManager - ftp://ftp.ensemblgenomes.org/pub/release-29/metazoa/regulation/drosophila_melanogaster/MotifFeatures.gff.gz cannot be downloaded
julie-sullivan commented 4 years ago
Hi Julie,
I just realized that it IS actually possible to filter the existing .gff file.
This relatively simple grep command is all you need in order to filter the file and create a new one that only contains the experimentally verified motif features.

grep 'experimental_evidence' Homo_sapiens.GRCh38.motif_features.gff >Homo_sapiens.GRCh38.motif_features_experimental_evidence.gff

Please let me know if this is good enough for you.

Thanks,
julie-sullivan commented 4 years ago
Line count File name
350,972,489 Homo_sapiens.GRCh38.motif_features.gff
11,978,682 Homo_sapiens.GRCh38.motif_features_experimental_evidence.gff
imedina commented 4 years ago

This looks promising, I will take a look next days