Gene Name attributes that start with gensp.

We talked about this already, but I'd like to take action and update the Datastore where appropriate.

There are a number of gene_models_main GFFs that have the Name attribute starting with gensp. This seems non-conformant to me, in the sense that Name is meant to be what a gene is called in the source material and the gensp prefixes tend to be an LIS thing.

Here's the list with an example of a GFF line for each case. I'd like @StevenCannon-USDA to confirm that the gene Name attributes should, in fact, contain the gensp prefix in these cases or, when not, to update the GFFs (outsourcing that to me is fine). Also, @adf-ncgr may have some arcane reasons for including the gensp prefix in certain cases. (Name uniqueness does not qualify as a reason, in my opinion, but he may have some JBrowse-related or other reasons for doing so.)

 starts_with | count 
-------------+-------
 cajca.      | 40071
 cicar.      | 58526
 glyma.      | 96036
 glyso.      | 102507
 lupal.      | 38258
 lupan.      | 33072
 medtr.      | 517176
 tripr.      | 39948
 vigun.      | 31948

cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main.gff3.gz:

cajca.ICPL87119.gnm1.Cc01   GLEAN   gene    13892   14559   0.659822    +   .   ID=cajca.ICPL87119.gnm1.ann1.C.cajan_19181;Name=cajca.C.cajan_19181;evid_id=C.cajan_GLEAN_10029733;Dbxref=Gene3D:G3DSA:1.10.10.60,InterPro:IPR001005,InterPro:IPR006447,InterPro:IPR009057,InterPro:IPR017930,JCVI_TIGRFAMS:TIGR01557,PANTHER:PTHR12802,PANTHER:PTHR12802:SF23,Pfam:PF00249,Prosite:PS51294,SMART:SM00717,Superfamily:SSF46689;Ontology_term=GO:0003677,GO:0003682;Note=MYB transcription factor MYB114 isoform X2 [Glycine max]%3B IPR009057 (Homeodomain-like)%3B GO:0003677 (DNA binding)%2C GO:0003682 (chromatin binding)

cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main.gff3.gz

cicar.CDCFrontier.gnm1.C11095950    GLEAN   gene    138 470 0.999968    +   .   ID=cicar.CDCFrontier.gnm1.ann1.Ca_28062;Name=cicar.CDCFrontier.Ca_28062;evid_id=GAR_10000002;Note=SAUR-like auxin-responsive protein family%3B IPR003676 (Auxin-induced protein%2C ARG7);Dbxref=InterPro:IPR003676,PANTHER:PTHR31374,PANTHER:PTHR31374:SF0,Pfam:PF02519;

cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main.gff3.gz

cicar.ICC4958.gnm2.Ca1  cicar.ICC4958.gnm2.ann1 gene    6359    6790    .   +   .   ID=cicar.ICC4958.gnm2.ann1.Ca_00001;Name=cicar.ICC4958.Ca_00001;Dbxref=Gene3D:G3DSA:3.40.50.720,InterPro:IPR009036,InterPro:IPR016040,PANTHER:PTHR10953,PANTHER:PTHR10953:SF29,Superfamily:SSF69572;Note=NEDD8-activating enzyme E1 regulatory subunit-like protein%3B IPR016040 (NAD(P)-binding domain)

Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz

glyma.Lee.gnm1.Gm01 phytozomev13    gene    37775   37993   .   +   .   ID=glyma.Lee.gnm1.ann1.GlymaLee.01G000100;Name=glyma.Lee.gnm1.ann1.GlymaLee.01G000100;Note=Unknown protein

Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz

glyma.Wm82_ISU01.gnm2.Gm01  phytozomev13    gene    78503   103594  .   -   .   ID=glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050;Name=glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050;Dbxref=Gene3D:G3DSA:3.30.390.10,Gene3D:G3DSA:3.40.50.970,Prosite:PS51257,Superfamily:SSF52518;Note=protein PHYLLO%2C chloroplastic-like isoform X5 [Glycine max]

Glycine/soja/annotations/PI483463.gnm1.ann1.3Q3Q/glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main.gff3.gz

glyso.PI483463.gnm1.Gs01    phytozomev13    gene    42343   43123   .   -   .   ID=glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100;Name=glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100;Dbxref=Gene3D:G3DSA:3.30.390.10;Note=protein PHYLLO%2C chloroplastic-like isoform X4 [Glycine max]

Glycine/soja/annotations/W05.gnm1.ann1.T47J/glyso.W05.gnm1.ann1.T47J.gene_models_main.gff3.gz

glyso.W05.gnm1.Chr01    maker   gene    60339   60901   .   -   .   ID=glyso.W05.gnm1.ann1.Glysoja.01G000001;Name=glyso.W05.gnm1.ann1.Glysoja.01G000001;Dbxref=Gene3D:G3DSA:3.30.390.10;Note=protein PHYLLO%2C chloroplastic-like isoform X1 [Glycine max]

Lupinus/albus/annotations/Amiga.gnm1.ann1.3GKS/lupal.Amiga.gnm1.ann1.3GKS.gene_models_main.gff3.gz

lupal.Amiga.gnm1.Lalb_Chr00c01  EuGene  gene    40143   40433   .   +   .   ID=lupal.Amiga.gnm1.ann1.gene:Lalb_Chr00c01g0403611;Name=lupal.Lalb_Chr00c01g0403611;locus_tag=Lalb_Chr00c01g0403611;Dbxref=PANTHER:PTHR11439;Note=Retrotransposon protein%2C putative%2C unclassified n%3D1 Tax%3DOryza sativa subsp. japonica RepID%3DQ10SZ0_ORYSJ

Lupinus/angustifolius/annotations/Tanjil.gnm1.ann1.nnV9/lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main.gff3.gz

lupan.Tanjil.gnm1.NLL-01    lupan.Tanjil.gnm1.ann1.nnV9 gene    603 4044    0.696   +   .   ID=lupan.Tanjil.gnm1.ann1.Lup027320;Name=lupan.Lup027320;source_id=Lupinus_GLEAN_10030675;identical_support_id=CUFF72.441.1;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR001917,InterPro:IPR003663,InterPro:IPR005828,InterPro:IPR005829,InterPro:IPR016196,InterPro:IPR020846,JCVI_TIGRFAMS:TIGR00879,PANTHER:PTHR24063,PANTHER:PTHR24063:SF171,PRINTS:PR00171,Pfam:PF00083,Prosite:PS00216,Prosite:PS00217,Prosite:PS00599,Prosite:PS50850,Superfamily:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0008152,GO:0016020,GO:0016021,GO:0016740,GO:0022857,GO:0022891,GO:0055085;Note=Membrane transporter D1 n%3D3 Tax%3DAndropogoneae RepID%3DB6U4Q3_MAIZE%3B IPR001917 (Aminotransferase%2C class-II%2C pyridoxal-phosphate binding site)%2C IPR005828 (General substrate transporter)%2C IPR016196 (Major facilitator superfamily domain%2C general substrate transporter)%3B GO:0005215 (transporter activity)%2C GO:0006810 (transport)%2C GO:0008152 (metabolic process)%2C GO:0016020 (membrane)%2C GO:0016021 (integral component of membrane)%2C GO:0016740 (transferase activity)%2C GO:0022857 (transmembrane transporter activity)%2C GO:0022891 (substrate-specific transmembrane transporter activity)%2C GO:0055085 (transmembrane transport)

Medicago/truncatula/annotations/HM056.gnm1.ann1.CHP6/medtr.HM056.gnm1.ann1.CHP6.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM058.gnm1.ann1.LXPZ/medtr.HM058.gnm1.ann1.LXPZ.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM060.gnm1.ann1.H41P/medtr.HM060.gnm1.ann1.H41P.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM095.gnm1.ann1.55W4/medtr.HM095.gnm1.ann1.55W4.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM125.gnm1.ann1.KY5W/medtr.HM125.gnm1.ann1.KY5W.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM129.gnm1.ann1.7FTD/medtr.HM129.gnm1.ann1.7FTD.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM185.gnm1.ann1.GB3D/medtr.HM185.gnm1.ann1.GB3D.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM324.gnm1.ann1.SQH2/medtr.HM324.gnm1.ann1.SQH2.gene_models_main.gff3.gz

medtr.HM324.gnm1.scaffold_0 .   gene    5673    6194    .   +   .   ID=medtr.HM324.gnm1.ann1.g1;Name=medtr.HM324.g1;Dbxref=InterPro:IPR010259,Pfam:PF05922;Ontology_term=GO:0004252,GO:0042802,GO:0043086;Note=subtilisin-like protease-like isoform X7 [Glycine max]%3B IPR010259 (Proteinase inhibitor I9)%3B GO:0004252 (serine-type endopeptidase activity)%2C GO:0042802 (identical protein binding)%2C GO:0043086 (negative regulation of catalytic activity)

Trifolium/pratense/annotations/MilvusB.gnm2.ann1.DFgp/tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz

tripr.MilvusB.gnm2.Tp1  ensembl gene    1135    2485    .   -   .   ID=tripr.MilvusB.gnm2.ann1.gene2499;Name=tripr.gene2499;Note=F1F0-ATPase inhibitor protein%252C putative%253B IPR007648 (ATPase inhibitor%252C IATP%252C mitochondria)%253B GO:0004857 (enzyme inhibitor activity)%252C GO:0005739 (mitochondrion)%252C GO:0045980 (negative regulation of nucleotide metabolic process)%253B*-**%253B AT5G04750.1;Dbxref=Coils:Coil,InterPro:IPR007648,Pfam:PF04568;Ontology_term=GO:0004857,GO:0005739,GO:0045980

Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann2.FD7K/vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main.gff3.gz

vigun.IT97K-499-35.gnm1.Vu01    phytozomev13    gene    1951    3899    .   +   .   ID=vigun.IT97K-499-35.gnm1.ann2.Vigun01g000100;Name=vigun.IT97K-499-35.Vigun01g000100;ancestorIdentifier=Vigun01g000100.v1.1;Dbxref=InterPro:IPR011108,PANTHER:PTHR11203,PANTHER:PTHR11203:SF8,Pfam:PF07521,Superfamily:SSF56281;Note=cleavage and polyadenylation specificity factor 73 kDa subunit-II%3B IPR011108 (RNA-metabolising metallo-beta-lactamase)

Here are my opinions about those cases:

cajca.C.cajan_19181 ==> C.cajan_19181
cicar.CDCFrontier.Ca_28062 ==> Ca_28062
cicar.ICC4958.Ca_00001 ==> Ca_00001
glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100
glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050 ==> GmISU01.01G000050  *[1]
glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100 ==> GlysoPI483463.01G000100
glyso.W05.gnm1.ann1.Glysoja.01G000001 ==> Glysoja.01G000001
lupal.Lalb_Chr00c01g0403611 ==> Lalb_Chr00c01g0403611
lupan.Lup027320 ==> Lup027320
medtr.HM324.g1 ==> g1     *[2]
tripr.gene2499 ==> gene2499
vigun.IT97K-499-35.Vigun01g000100 ==> Vigun01g000100

Notes: *[1] Wm82_ISU01 will probably go away soon, to be replaced by an annotation called Wm82.gnm6.ann1, with names looking like glyma.01G00100 (derived from Wm82.gnm4.ann4 names when possible).

*[2] For the Medicago genes, here is a full list of the forms:

for file in */*gene_models_main.gff3.gz; do 
  zcat $file | awk -v FS="\t" '$3~/gene/ {print $9}' | head -1 | 
  perl -pe 's/ID=([^;]+);Note=[^;]+;Name=([^;]+);.+/$1\t$2/' |
  perl -pe 's/ID=([^;]+);Name=([^;]+);.+/$1\t$2/'; 
done
medtr.A17_HM341.gnm4.ann2.Medtr1g004930 Medtr1g004930
medtr.A17.gnm5.ann1_6.MtrunA17CPg0492171    MtrunA17CPg0492171
medtr.HM004.gnm1.ann1.g1    HM004.g1
medtr.HM010.gnm1.ann1.g1    HM010.g1
medtr.HM022.gnm1.ann1.g1    HM022.g1
medtr.HM023.gnm1.ann1.g1    HM023.g1
medtr.HM034.gnm1.ann1.g1    HM034.g1
medtr.HM050.gnm1.ann1.g1    HM050.g1
medtr.HM056.gnm1.ann1.g1    medtr.HM056.g1
medtr.HM058.gnm1.ann1.g1    medtr.HM058.g1
medtr.HM060.gnm1.ann1.g1    medtr.HM060.g1
medtr.HM095.gnm1.ann1.g1    medtr.HM095.g1
medtr.HM125.gnm1.ann1.h3436.02  medtr.HM125.h3436.02
medtr.HM129.gnm1.ann1.g1    medtr.HM129.g1
medtr.HM185.gnm1.ann1.g1    medtr.HM185.g1
medtr.HM324.gnm1.ann1.g1    medtr.HM324.g1
medtr.R108_HM340.gnm1.ann1.BZG31_000s000010 BZG31_000s000010
medtr.R108.gnmHiC_1.ann1.MtrunR108HiC_000001    MtrunR108HiC000001

For these, I'd like a second opinion from @adf-ncgr and Joann if appropriate. Regularity says the genes in the Zhou ... Young set should have the form "g#". But I feel a little squeamish about this. These genomes were all released and described as a group, and the assemblies and accessions are referred to in the main paper (Zhou, Silverstein et al., 2017) by the five-character HM### string. That said, I don't see particular instances in the paper where particular genes are discussed by name, so I don't feel I can make a strong argument for going beyond "g#".

Thanks for opinions, @StevenCannon-USDA , I'll put these into a to-do checkbox list here and I'll start updating them after giving @adf-ncgr and @joannmudge a chance to object. As for the Zhou, Silverstein, at al. genomes, I agree that the Name attribute should be just the final piece (g1) for regularity, but we do sacrifice regularity at times for Higher Reasons.

[ ] cajca.C.cajan_19181 ==> C.cajan_19181
[ ] cicar.CDCFrontier.Ca_28062 ==> Ca_28062
[ ] cicar.ICC4958.Ca_00001 ==> Ca_00001
[ ] glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100
[ ] glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050 ==> GmISU01.01G000050 *[1]
[ ] glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100 ==> GlysoPI483463.01G000100
[ ] glyso.W05.gnm1.ann1.Glysoja.01G000001 ==> Glysoja.01G000001
[ ] lupal.Lalb_Chr00c01g0403611 ==> Lalb_Chr00c01g0403611
[ ] lupan.Lup027320 ==> Lup027320
[ ] tripr.gene2499 ==> gene2499
[ ] vigun.IT97K-499-35.Vigun01g000100 ==> Vigun01g000100

OK, I'm in agreement with most of these, but I think the original Names for tripr were actually like "Tp57577_TGAC_v2_gene10066" not just "gene10066" so should we use that instead, like Phytozome and Ensembl seem to do? I'm still a little unclear about what the principal is here (originalism or aesthetics), though we once tried to pin it down here: https://github.com/legumeinfo/datastore-specifications/issues/44

I would personally vote to keep medtr.HMxxx.g1 (or at least HMxxx.g1) which is seemingly no more problematic than having GlymaLee as part of a name, it just happens to also be identical with part of our full yuck system. But if we think g1 is better for any given medicago accession, I think that implies that strict Name originalism is the principle here, no matter how bad we think the names are, meaning we should be stuck with Tp57577_TGAC_v2_gene10066.

But whatever we decide, let's take it as an opportunity to resolve the open questions in https://github.com/legumeinfo/datastore-specifications/issues/44

I think we're close to convergence, and are down to the point of splitting hairs - which I guess is unavoidable. Here's the spec as it stands: https://github.com/legumeinfo/datastore-specifications/tree/main/Genus/species/annotations

And a key clause:

Where available in the original annotations, the names should come from those annotation files, with the possible exception of stripping type identifiers (e.g. "gene:"), or shortening exceptionally cumbersome auto-generated strings or lengthy prefixes added in the original annotation form if those prefixes do not contribute to the uniqueness of the names within the annotation file. Such exceptions will need to be considered on a case-by-casse basis.

I would say that "exceptionally cumbersome strings ... if those prefixes do not contribute to the uniqueness of the names within the annotation file" is a fair description of Tp57577_TGAC_v2_gene10066. I mean: the Trifolium team has encoded Genus (T), species (p), accession (57577 I think), sequencing center (TGAC), and assembly version v2. I think this is a worthy case for an exception (shortening it to "gene10066"). But I won't fight anyone over it. If Sam is implementing, I say: go ahead and do what you think is right, and we'll be prepared to be delighted.

Thanks @StevenCannon-USDA, sounds like that clause is indeed the final refuge for the hair-splitters! I am in favor of shortening where there is substantial overlap with what full yuck is accomplishing. I think this would mean that we'd allow: Name=Lcu.2RBY.1g010820 ID=lencu.CDC_Redberry.gnm2.ann1.1g010820 if we need to keep the IDs below some max length limit imposed by certain tools (e.g. BLAST)? Name here is "original". Or would we require that Name be 1g010820 if we invoked the "lengthy prefixes" clause on this one?

@adf-ncgr - yeah, I think I'd leave Lcu.2RBY.1g010820 (which means changing the ID in that case).

... When you're running an Airbnb and some guests insist on bringing all their own furniture.

This issue (see the title of this issue) is about the gensp prefixes, which it appears we all agree should be dropped. A protocol for how we populate the Name attribute otherwise is certainly a Good Thing. I don't see any argument for keeping the gensp prefix here, so I'll yank those from the appropriate places, and we can move the discussion of Names in general back to https://github.com/legumeinfo/datastore-specifications/issues/44 . I'll keep this issue open just so I can hit my checkboxes.

And yes, in the few cases where Name is full-yuck, I'll de-yuckify it down to the non-yuck portion. (example: glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100).

legumeinfo / datastore-issues

Gene Name attributes that start with gensp. #187