legumeinfo / jira-issues

placeholder repo for issues migrating from JIRA system, to be moved to their appropriate places later
0 stars 0 forks source link

additional import associations of polypeptides from phytozome with gene families #51

Closed adf-ncgr closed 9 years ago

adf-ncgr commented 10 years ago

A fair number of associations of Glyma/Phavu/Medtr proteins with phytozome gene families have been missed due to a subtlety in the way that phytozome distinguishes "founding members" (which are present in the MSAs we downloaded as the foundation of our import) from other membership types (as per email from David Goodstein), pasted below. I have downloaded a representation of all angiosperm family-polypeptide associations that can be used to augment what we already have (will need a little filtering before it is of use). The main use case for having these at this time would be for proper coloration in the context views. We may try to reincorporate them into MSA/Trees sometime (although perhaps we will find that phytozome's exclusion makes sense for us too?).

The number of proteins contained in the MSAs vs those contained in the familes for each of the legume species is as follows:
species MSA count family count gene count
Glyma 47681 56679 56044
Medtr35341 51574 50894
Phavu 25456 27356 27197
just noticed that the discrepancy between family counts and gene counts implies some genes associated with multiple families, even within a phytozome level (here Angiosperm), which is unexpected, but appears to be the case, e.g.:

phytozome_family_peptides.tsv:54574520 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54580213 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54586455 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54623738 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54642978 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54660202 Pentatricopeptide repeat-containing protein At1g31790 (SwissProt v35) Phvul.005G052800.1
phytozome_family_peptides.tsv:54695893 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54701588 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54749160 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54755713 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54757210 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54759268 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54783184 "" Phvul.005G052800.1
phytozome_family_peptides.tsv:54788557 "" Phvul.005G052800.1

will need to follow up on this a bit more...

phytozome email follows:
Hi Andrew,

I’m glad you’re finding the families useful; they are a big focus of our efforts now as we try to improve analysis of some of our larger families and refine the orthology calling pipeline (hope to be releasing some of that on the site in October).
Your question is another reminder to me to get some documentation out! That's not a database inconsistency you're; it’s just that in v10.0.3 and later releases, we changed the default family view to only show “Founding” members. For the family in question, if you click the “226 members” button in the size field, it will show All members, not just founders. You’ll then see that the phaseolus gene in question has a membership type of “Pi”, which means “Pledge inconsistent”. That means that according to various scoring criteria of hits of the gene against the HMM family profiles at various nodes, it does not qualify to be a “Founding” member. That’s also why it wasn’t in the precomputed MSA stored in InterMine (another aspect we should document better).
We will be moving to a more InterPro-like handling of non-founder family members: they will simply be reported as having hits to various families at various nodes, but will not be reported as “members, but second class members” of the various families.

best,
-David

On Sat, Sep 20, 2014 at 1:48 PM, Andrew Farmer wrote:

Hi-
we've been using the very helpful phytozome gene family information for some work
on our projects, and recently noticed something which seems a bit peculiar regarding
a gene family association in phytozome10 (it may be an isolated example, I haven't
investigated this deeply yet).

For the P. vulgaris gene Phvul.011G000600, in the gene page:
phytozome.jgi.doe.gov/pz/portal.html#!gene?search=1&detail=1&method=3253&searchText=transcriptid:27150729

the gene ancestry table seems to indicate that the Angiosperm family for this gene is 54708129
(and this seems correct based on the homology and annotation of the gene).

However, in the page for the family:
http://phytozome.jgi.doe.gov/pz/portal.html#!showCluster?search=1&detail=0&method=4835&searchText=clusterid:54708129

the gene Phvul.011G000600 is not listed as a member (and was not present in the MSA I downloaded for
this family through the intermine service).

It seems likely that something is wrong in the database, given the conflict between the two views,
but perhaps I'm simply misunderstanding how the family data is being represented. Any insight
you could give would be most helpful.

thanks in advance

Andrew Farmer

[LEGUME-83] created by adf_ncgr

adf-ncgr commented 9 years ago

at least for now, this has become irrelevant given the current approach for assigning to phytozome models on the basis of best hit (which both introduces the possibility of inconsistent assignments and also obviates the possibility of multiple family assignments)
.

by adf_ncgr