iprscan results loading

adf-ncgr commented 9 years ago

we would like to have interpro family member domains described on our polypeptide features, ultimately to enable displays similar to what can be seen at phytozome, e.g.:
http://phytozome.jgi.doe.gov/pz/portal.html#!gene?search=1&detail=1&crown&method=0&searchText=transcriptid:30476280
for which it is important to capture the locations of the domains as featurelocs referencing the polypeptides

before we proceed with the plan to use the modified iprscan gff output, we want to take a last look at the tripal iprscan loader and verify that it is not currently suitable for this purpose. My recollection from earlier testing is that the displays at other tripal sites such as this:
http://www.cottongen.org/gossypium/gossypium_raimondii/Cotton_D_gene_10006207

were just based on storing the iprscan XML as a featureprop of the gene.

My preference is to try to treat the iprscan hits as pairwise alignments between the protein and the HMM of the target database, and to handle this in chado similarly to how other pairwise alignments like BLAST are represented, as described here:

http://gmod.org/wiki/Chado_Best_Practices#Results_from_BLAST

[LEGUME-230] created by adf_ncgr

adf-ncgr commented 9 years ago

Got my own lis instance (lis-peu) to work on from Nathan.

Referring to instructions on this page:
http://tripal.info/node/131

by peu

adf-ncgr commented 9 years ago

last week Andrew installed tripal_analysis_interpro module on lis-dev and also on lis-peu (my instance)
I tested it on my lis instance and we can see iprscan results as 'Interpro Report' link on feature page of polypeptide on UI
(similar to the link above mentioned by Andrew of cottongen) but the way this drupal module loads data in database may not be what we are looking for. It populates chado tables:
analysis (to create interpro analysis id)
analysisfeature (to show features involved in this analysis)
analysisfeatureprop (with XML chunk as it is in column 'value')
AND feature_loc table is not populated with alignment data, only Alignment co-ordinates are displayed on UI page as coord.

by peu

adf-ncgr commented 9 years ago

We plan to revisit this task once I am done with loading cajca, cicar.

Next step forward for this task for me will be to take the traditional route of loading GFF3 file of iprscan v5 results.
Before that I need to parse and edit this GFF3 by inputting hmm start end coordinate values in GFF3's columns 4 & 5 from its XML format file.

by peu

adf-ncgr commented 9 years ago

Currently working on this task this week. Testing and writing the script (to extract hmm start & end co-ordinates) from iprscan XML file using module XML::Simple.

by peu

adf-ncgr commented 9 years ago

Working on the script still….
So far achieved the extraction of hmm-start and hmm-end co-ordinates of protein-matches from iprscan XML's and element tags. And now the script prints rows of domain signature accession with their hmm co-ord (which are needed to be replaced in GFF's 9th column 'Target=' start end [strand]).
Please note: Col 4th and 5th of GFF need not be changed as they are mapped co-ord relative to polypeptide landmark in column1. (After discussing with Andrew, we have decided to keep the signatures that don't have hmm co-ord as it is)
Also worked on GFF3 file parsing to extract 'Name=' value from 9th column to place it in 'target_id' field of 'Target' attribute.
I will have to piece this all together to replace values in Target attribute of GFF to represent the hmm matches as alignments and then test loading using gff_bulk_loader.

I will continue this task next week when I am back on LIS.

Andrew, current development copy of code is at
~peu/interpro_stuff/iprscan_XMLSimple_test_adf_peu.pl
if you want to check the progress so far or want to check the output.

Thanks~

by peu

adf-ncgr commented 9 years ago

The parsing code is working fine and output GFF files can be produced using this script (that will incorporate hmm-start and hmm-end for protein matches). Also making changes in local Adapter.pm module to have a common organism representation of each protein target feature. And features are created for these protein targets using Name= and ID= attributes from Column 9 of iprscan parsed GFF files.
So far now we can see these matches in 'Alignment' tab of each polypeptide feature on my lis-peu instance UI. (also, we have put the type for hmm matches as 'protein_hmm_match')

Andrew, if we decide to pre-load the chado database with these protein domains/families as features in chado before loading iprscan GFFs then I found out we will have to load not only Pfam but five following databases' accessions as features as iprscan targets seem to have coming from hmm model:

Gene3D
Pfam
SMART
PIRSF
TIGRFAM

Let me know your thoughts. My thoughts are pre-loading these huge collections as features in chado will require us to load them as analysis feature types (for iprscan purpose) since we are not thinking about them as db accessions in which case they would go in dbxref table.

by peu

adf-ncgr commented 9 years ago

But if we stick to gmod_bulk_load_gff3 loader's loading of protein domains as features (which it does automatically) as and when they occur in our iprscan gffs while loading iprscan results, then we won't have to worry about pre-loading of these dbs as I mentioned above. But I may be wrong in the interpretation of situation/expectations.
When you get some free time, please write your opinion.

by peu

adf-ncgr commented 9 years ago

Hi Pooja-
as we just discussed, go ahead without pre-loading. As long as the loader is not
creating duplicates, it is fine (and probably preferable) if we don't create feature to
represent HMMs for which we have not yet seen any matches.

looking forward to seeing how this looks when you've loaded some more data...

adf

by adf_ncgr

adf-ncgr commented 9 years ago

Phavu and Glyma iprscan data loaded on lis-dev and also synced. It an be viewed on UI by everyone.

Please go to the "Alignment" tab on Polypeptide page of these species to check this new iprscan data on UI.

by peu

adf-ncgr commented 9 years ago

Done parsing iprscan GFFs, loaded and synced them on lis-dev for all species (phavu, medtr, glyma, araip, aradu, cicar, cajca). Next, I am working on creating MView in drupal for display of count of polypeptides per species that have same protein domain/family.

by peu

adf-ncgr commented 9 years ago

After mview creation and success, I created tripal_domain module on lis-peu for that mview for easy portability that can be installed by other users(which is not possible only with mview) and at present I couldn't transfer it on lis-dev for group access because all the data needed for it had to be replaced by last rollover snapshot of lis-stage. So we just had Nathan make a dump of my loaded data onto lis-peu for me to continue with the testing work.

Illiana installed this module from my public github (https://github.com/pumale/tripal_domain) on her instance and it works fine. I still have to test it for correct working on my instance (it is not creating view on lis-peu for some reason).

I will start loading the iprscan data in lis-stage once we decide to have these large number of features in chado and once I get go-ahead from Andrew/Nathan.

Work still under progress….

Thanks~

by peu

adf-ncgr commented 9 years ago

Also, interpro ontology was loaded (interpro.xml) using a loader script. Have to test GENE3D dbxref creation issue in it. All other databases' dbxrefs in this ontology file has been loaded correctly.

by peu

adf-ncgr commented 9 years ago

Okay. tripal_domain is successfully installed on my instance as well. It was breaking because it was getting confused with old mview. "Revert" of view solved the issue.

by peu

adf-ncgr commented 9 years ago

Hello –

I messed around with the interproscan-derived protein domain annotations, used a few search terms, domain names/ID, interpro term and all worked fine.

While messing around I noticed that some Domain ID's have no Domain name, Interpro term or Description, example PF15628. I am not sure if this is how it is supposed to be but I thought I would mention it.

The other thing I was thinking - as a user of the LIS - I would like some sort of way to see what gene families the Domain ID are associated with.

If you want me to messed around with the interproscan-derived protein domain annotations, just let me know.

by jdjax

adf-ncgr commented 9 years ago

Hello Jacqueline, Thank you for the feedback!

To answer your comments-

1.You are right, you will see some Domain IDs not having Domain name, Interpro term or Description on this page because the original source of this data (Interpro ontology ver 46.0) does not have entries for those domains. You can also check verify this file where this data is coming from here:
on lis-dev goto: ~peu/interpro_stuff/interpro_46.0.xml

2. For gene families association, we already have that functionality. If you click on the various counts link (e.g Glyma count column) next to the domain, it will redirect you to the gene page which in itself have a column where gene_familiy (link) associated with that domain is available.

Hope this answer your query.

Pooja

by peu

adf-ncgr commented 9 years ago

Some observations on protein domain search module

– Overview display; organism, consensus consensus (consensus); ex: PF13251 # see if repeating words is avoidable.
– The IPR# link in the interpro col and the 'more' link in the next go to the exact same place; may think of keeping only one, preferably the IPR# link.

– Does it make sense (you can get to it from IPro) to provide direct Pfam link (e.g., http://pfam.xfam.org/family/PF03106) since we are giving the IPro link. The Pfam page provides a nice protein str image

Liked:
– leads to genes that code for these domains and their functional annotations and opportunity to build a gene list. Also leads to the corresponding gene family**.

– link to interpro domain for further and detailed info about the domain.

**Would there be a way starting from gene family to what domains they contain?? I am just being over ambitious perhaps!

by sdash-legume

adf-ncgr commented 9 years ago

Two issues in domain data have been resolved:
1] Two xref tags inside one protein element in iprscan GFF have been taken care of by separating them.
Example from XML file:

2] Some type discrepancy where some mRNAs were linked in the alignment section of domains instead of all polypeptides has been resolved by updating 'Adapter.pm' code and also using --recreate_cache option while loading data.

Andrew has deleted previous incomplete data fairly quickly from lis-stage by putting index on feature_id in table phylonode (this made deletion process faster). And we have re-loaded the corrected data on lis-stage and also loaded domain data for aradu and araip on peanutbase-stage.

Feel free to check the work at url /search/protein_domains

If everybody is happy with the page we can close this issue.

by peu

adf-ncgr commented 9 years ago

Data that has been loaded on lis and peanutbase is stored at /legumeinfo/iprscan_data/

by peu

legumeinfo / jira-issues

iprscan results loading #198

Some observations on protein domain search module