pantherdb / db-PAINT

Application for curators to make Phylogenetic-based gene function predictions.
1 stars 0 forks source link

Motif / Domain Cartoon View #3

Closed mugitty closed 5 years ago

mugitty commented 5 years ago

In addition to the MSA View, there would be a new view in which Pfam domain and signature hits for each sequence would be graphically represented as GIFs. The GIFs lengths and positions would be proportional to the fraction of the sequence covered by each hit and their positions in that sequence. Each GIF of differing Pfam models match would be colored differently for a particular Panther family. There is no special importance to the colours themselves other than to visually aid in seeing the presence of different Pfam hits. Hovering the mouse on top of a GIF would cause the Pfam name to be displayed. How to handle Pfam hits that overlap for a particular sequence is not known. I (Michael) believe that we offset these (vertically) in the original Tree Attribute Viewer. - @dustine32 to load data and let me know

dustine32 commented 5 years ago

Looking for data in Pfam FTP, I see the domain info is in the proteome-specific files here: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/proteomes/

Files are named by NCBI taxon ID. So we only need to download the 132 files for the genomes we support. Using UniProtKB:P48740 as an example: image

You can get rows from 9606.tsv (for human) containing some of the displayed data:

$ grep P48740 9606.tsv
#<seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <clan>
P48740  301 362 301 362 PF00084 Sushi   PfamLive::Result::SequenceOntology=HASH(0x89b5d08)  1   56  56  41.60   3.5e-07 CL0001
P48740  367 432 367 432 PF00084 Sushi   PfamLive::Result::SequenceOntology=HASH(0x89b5d08)  1   56  56  27.70   0.008   CL0001
P48740  449 691 449 691 PF00089 Trypsin PfamLive::Result::SequenceOntology=HASH(0x89b6ff0)  1   221 221 248.50  1.6e-70 CL0124
P48740  185 294 185 294 PF00431 CUB PfamLive::Result::SequenceOntology=HASH(0x8a04a08)  1   110 110 108.50  5.3e-28 CL0164
P48740  24  135 20  135 PF00431 CUB PfamLive::Result::SequenceOntology=HASH(0x8a04a08)  5   110 110 66.70   5.2e-15 CL0164

@mugitty Would this format work for you to display the domain info? Does this data need to be loaded into the DB or can the 132 files just sit on a server somewhere for the tool to access?

I haven't yet been able to find the raw residue annotation data like this: image Nothing in the above tsv file, though you can click that download link to get the JSON file, which contains these residue annotations, gene-by-gene. I'm wondering if they don't publish these in bulk (e.g. by genome) and/or we have to ask them for it.

That huge uniprot_reference_proteomes.dat.gz I downloaded from Pfam was just the reference proteome data used in construction of this Pfam version. It didn't even contain the domain info.

mugitty commented 5 years ago

@dustine32 we can start off with the 132 files. I will see how long it takes to parse the files for a subset of trees. I only need to parse files based on the organisms in a given tree.

dustine32 commented 5 years ago

@mugitty I have these sitting on our FTP server now:

http://data.pantherdb.org/Pfam/

mugitty commented 5 years ago

Thanks @dustine32 I will try and parse these.

mugitty commented 5 years ago

Domain view has been incorporated into PAINT client