Closed adf-ncgr closed 2 years ago
I've already loaded this one, but I can re-load it on top later on when it's fixed with a merge priority.
yeah, no worries. not sure how soon I'm going to tackle fixing it, but I guess I should at least get the wheels in motion.
This file needs a-fixin'. Dashes instead of dots on CDS identifiers, etc. I don't know if you did anything, the file date is July 12 which is after this issue was posted. But stuff like this should be fixed:
vigan.Shumari.gnm1.Chr01 Vangularis_v1.a1 CDS 54435 54471 . - 1 ID=vigan.Shumari.gnm1.ann1.Vigan.01G000200.01-CDS-1;Parent=vigan.Shumari.gnm1.ann1.Vigan.01G000200.01
That should be vigan.Shumari.gnm1.ann1.Vigan.01G000200.01.CDS.1 but also since when do we use .01 as a transcript suffix instead of .1? Please make conformant.
Regarding part of the original issue having to do with ontology terms, looks like I got close but did not complete; will do so now.
Regarding the part of the issue having to do with general funkiness, the source db assigned the .01 suffixes. I don't see a compelling reason to change them although they are admittedly weird (though not the weirdest we've seen and retained); I don't think it violates any spec that I'm aware of.
Not sure who added the CDS IDs to this (possibly Steven?), but I could redo it using a script I've used on others that might make it more consistent with other cases in which we've added them (and not with cases where we've inherited them from the source); again I don't think they're in strict violation of anything are they?
I'm not sure about the exact history on this, but the original work was done in July 2018, and then some patching was done in 2019. Notes here: /usr/local/www/data/private/Vigna/angularis/Shumari.gnm1.ann1.8BRS/notes Probably doesn't matter at this point, apart from noting the original provenance: http://viggs.dna.affrc.go.jp
Thanks @cann0010 - while checking the original gff from the site I noticed they have a decent number of other Vigna assemblies available; although most are scaffold-level there is a chromosome level V. trilobata genome with annotations. To my shame, I hadn't previously heard of V. trilobata but let me know if you think it looks sufficiently interesting to be included.
Another eminently weird thing about the current Shumari file is that the genes don't seem to be Parents of the mRNA. Guessing this is because the original file doesn't have genes and one of us added them but didn't add Parent attributes. I can tackle this.
@cann0010 @sammyjava as in #31 doing AHRD on the original file adds Ontology_terms and suchlike to similar (but not identical) data already present in the original mRNA records (and which got copied into gene records in this case when one of us constructing gene records for the mRNAs). Without doing any cleansing of the original info (aside from substituting new for old "Note" attribute in the gene record), here's how things end up (bolding here is for the newly added stuff):
vigan.Shumari.gnm1.Chr01 Vangularis_v1.a1 gene 54386 57737 . - . ID=vigan.Shumari.gnm1.ann1.Vigan.01G000200;Name=Vigan.01G000200;Locus_id=Vigan.01G000200;Note=tetratricopeptide repeat (TPR)-containing protein%3B IPR011990 (Tetratricopeptide-like helical)%2C IPR012336 (Thioredoxin-like fold)%3B GO:0005515 (protein binding);InterPro=Tetratricopeptide-like helical (IPR011990),Thioredoxin-like fold (IPR012336),Tetratricopeptide repeat-containing domain (IPR013026),Tetratricopeptide repeat (IPR019734);GO=Molecular Function: protein binding (GO:0005515);Expression level (FPKM)=Co%2CEm%2CFl%2CLe%2CNo% 2CPo%2CRo%2CSt:15.64%2C19.28%2C7.63%2C34.32%2C3.20%2C11.60%2C3.45%2C11.57;Dbxref =Gene3D:G3DSA:1.25.40.10,Gene3D:G3DSA:3.40.30.10,InterPro:IPR011990,InterPro:IPR 012336,InterPro:IPR013026,InterPro:IPR019734,PANTHER:PTHR22904,Pfam:PF13181,Pfam :PF13414,Prosite:PS50293,SMART:SM00028,Superfamily:SSF48452,Superfamily:SSF52833 ;Ontology_term=GO:0005515 vigan.Shumari.gnm1.Chr01 Vangularis_v1.a1 mRNA 54386 57737 . - . ID=vigan.Shumari.gnm1.ann1.Vigan.01G000200.01;Parent=vig an.Shumari.gnm1.ann1.Vigan.01G000200;Name=Vigan.01G000200.01;Locus_id=Vigan.01G0 00200;Note=Similar to Uncharacterized protein. [I1K580%2C Glycine max];InterPro= Tetratricopeptide-like helical (IPR011990),Thioredoxin-like fold (IPR012336),Tet ratricopeptide repeat-containing domain (IPR013026),Tetratricopeptide repeat (IP R019734);GO=Molecular Function: protein binding (GO:0005515);Expression level (F PKM)=Co%2CEm%2CFl%2CLe%2CNo%2CPo%2CRo%2CSt:15.64%2C19.28%2C7.63%2C34.32%2C3.20%2 C11.60%2C3.45%2C11.57;Dbxref=Gene3D:G3DSA:1.25.40.10,Gene3D:G3DSA:3.40.30.10,Int erPro:IPR011990,InterPro:IPR012336,InterPro:IPR013026,InterPro:IPR019734,PANTHER :PTHR22904,Pfam:PF13181,Pfam:PF13414,Prosite:PS50293,SMART:SM00028,Superfamily:S SF48452,Superfamily:SSF52833;Ontology_term=GO:0005515
Let me know if you think it would be better to clean out/overwrite more of the original stuff in such cases.
Here's the attributes I read from GFFs:
// attributes
String id = featureI.getAttribute("ID");
String name = featureI.getAttribute("Name");
String parent = featureI.getAttribute("Parent");
String note = featureI.getAttribute("Note");
String dbxref = featureI.getAttribute("Dbxref");
String ontology_term = featureI.getAttribute("Ontology_term");
String alleles = featureI.getAttribute("alleles");
(alleles being for genetic_marker.)
If it ain't in one of those attributes, it ain't in the mine. Also, I load the GO terms from the Ontology_term attribute, not from parsing it out of Notes. Notes just gets reproduced in the description attribute of SequenceFeature.
I'll close this and reopen it down the line if there is a problem. All I care about is the attributes that I listed and reasonably reasonable identifiers and if it loads, it loads.
OK, the updated version is in the datastore.
Looks like this one hasn't been given the AHRD treatment. The original descriptors encoded into Notes in the current file resolve a mystery as to how one can have an mRNA without CDS, because seemingly every case where exons are given as the children of the mRNA record instead of CDS/UTR features, the Note says something like: Note=Non-protein coding gene
should do something about these, probably just convert to ncRNA and let them be. After all, it's entirely possible that the difference between an azuki bean that makes a good beard and one that doesn't is due to non-coding genes.