legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Inconsistent gene names in Arachis annotations #144

Open sammyjava opened 1 year ago

sammyjava commented 1 year ago

This may just be a "bugs Sam" issue, but it seems odd that we have inconsistent naming of genes in the Arachis annotations, with and without the 'Aradu' or 'Araip' prefix.

arachismine=> select name from gene where organismid=2000002 and assemblyversion='gnm1' limit 1;
    name     
-------------
 Aradu.RI96J

arachismine=> select name from gene where organismid=3000002 and assemblyversion='gnm1' limit 1;
  name  
--------
 URH0Q0

arachismine=> select name from gene where organismid=3000002 and assemblyversion='gnm2' limit 1;
  name  
--------
 US28JK

arachismine=> select name from gene where organismid=4000002 and assemblyversion='gnm1' limit 1;
    name     
-------------
 Araip.8D2WZ
adf-ncgr commented 1 year ago

bugs adf and probably others too; I think the history on this is that when we were making the hypogaea annotations, @cann0010 had suggested that there was yuck-redundancy for the diploids with e.g. aradu.V14167.gnm1.ann1.Aradu.RI96J (and back then we were also using aradu.Aradu.RI96J for display names in trees instead of embracing full yuck which we have done since, so it was even more obviously goofy). At that time, I didn't see any absolute necessity that ID = full-yuck+Name as long as it was full-yuck+something_identifying so IIRC we had used e.g. ID=arahy.Tifrunner.gnm1.ann1.URH0Q0 and Name = arahy.URH0Q0 ; but I think someone remunged the gff file at some point.

Probably worth discussing further as a more general topic. This business of Name and ID and how they are used is definitely a weak point of gff3- the AgBioData working group made a proposal about it, but it is not consistent with what we're currently doing (IIRC, the suggestion was basically not using ID for anything other than internal referencing within the gff file and introducing a new attribute based on CURIEs). It may be that the new system we're adopting of supplying id_map files for features can play some role here (ie I think the way we're using name is usually "what the original group called it" which is usually but not always the same as what we end up doing with full-yuckification, especially as more groups start to include full-yuck-like metadata in what they call things)