Open sammyjava opened 9 months ago
And I'll assign myself to back-editing the existing annotation collections to follow the specification established from this task. The goal will be to have the updated annotations ready for the mine 5.1.0.4 load in the future.
Starting with diagnosis. Here's what we have now:
for filepath in /usr/local/www/data/v2/*/*/annotations/*/*gene_models_main.gff3.gz ; do
export base=`basename $filepath .gff3.gz`
zcat $filepath | grep -v "#" | head -10000 |
awk -v BN=$base '$3~/gene/ {print BN "\t" $9}' | tail -1 |
perl -lane '$base=$F[0]; @attrs=split(";", $F[1]);
@id=grep(/ID=/, @attrs);
@name=grep(/Name=/, @attrs);
unless (defined $name[0] ){$name[0]="MISSING"};
$id[0] =~ s/ID=//;
$name[0] =~ s/Name=//;
print join("\t", $base, $id[0], $name[0]);
'
done
aesev.CIAT22838.gnm1.ann1.ZM3R.gene_models_main aesev.CIAT22838.gnm1.ann1.Ae01g07470 Ae01g07470
aradu.V14167.gnm1.ann1.cxSM.gene_models_main aradu.V14167.gnm1.ann1.Aradu.KGT5H Aradu.KGT5H
arahy.BaileyII.gnm1.ann1.PQM7.gene_models_main arahy.BaileyII.gnm1.ann1.mikado.chr01G571 mikado.chr01G571
arahy.Tifrunner.gnm1.ann1.CCJH.gene_models_main arahy.Tifrunner.gnm1.ann1.HA8THR HA8THR
arahy.Tifrunner.gnm2.ann1.4K0L.gene_models_main arahy.Tifrunner.gnm2.ann1.HA8THR HA8THR
arahy.Tifrunner.gnm2.ann2.PVFB.gene_models_main arahy.Tifrunner.gnm2.ann2.Ah01g088800 Ah01g088800
araip.K30076.gnm1.ann1.J37m.gene_models_main araip.K30076.gnm1.ann1.Araip.L423N Araip.L423N
cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main cajca.ICPL87119.gnm1.ann1.C.cajan_04851 cajca.C.cajan_04851
cajca.ICPL87119.gnm2.ann1.L3ZH.gene_models_main cajca.ICPL87119.gnm2.ann1.Cc_00501 Cc_00501
cerca.ISC453364.gnm3.ann1.3N1M.gene_models_main cerca.ISC453364.gnm3.ann1.Cecan.1G059700 Cecan.1G059700
cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main cicar.CDCFrontier.gnm1.ann1.Ca_02646 cicar.CDCFrontier.Ca_02646
cicar.CDCFrontier.gnm2.ann1.9M1L.gene_models_main cicar.CDCFrontier.gnm2.ann1.Ca_00491 Ca_00491
cicar.CDCFrontier.gnm3.ann1.NPD7.gene_models_main cicar.CDCFrontier.gnm3.ann1.Ca1g082300 Ca1g082300
cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main cicar.ICC4958.gnm2.ann1.Ca_00646 cicar.ICC4958.Ca_00646
cicec.S2Drd065.gnm1.ann1.YZ9H.gene_models_main cicec.S2Drd065.gnm1.ann1.Ce0g133700 Ce0g133700
cicre.Besev079.gnm1.ann1.F01Z.gene_models_main cicre.Besev079.gnm1.ann1.Cr1g085100 Cr1g085100
faial.WAFC.gnm1.ann1.RTP9.gene_models_main faial.WAFC.gnm1.ann1.Faial112S01341 MISSING
glycy.G1267.gnm1.ann1.HRFD.gene_models_main glycy.G1267.gnm1.ann1.Gcy1g000849 Gcy1g000849
glyd3.G1403.gnm1.ann1.XNZQ.gene_models_main glyd3.G1403.gnm1.ann1.Gto1g000840 Gto1g000840
glydo.G1134.gnm1.ann1.4BJM.gene_models_main glydo.G1134.gnm1.ann1.Gtt1g000960 Gtt1g000960
glyfa.G1718.gnm1.ann1.2KSV.gene_models_main glyfa.G1718.gnm1.ann1.Gfa1g000870 Gfa1g000870
glyma.58-161.gnm1.ann1.HJ1K.gene_models_main glyma.58-161.gnm1.ann1.SoyL04_01G042100 MISSING
glyma.Amsoy.gnm1.ann1.6S5P.gene_models_main glyma.Amsoy.gnm1.ann1.SoyC05_01G042200 MISSING
glyma.DongNongNo_50.gnm1.ann1.QSDB.gene_models_main glyma.DongNongNo_50.gnm1.ann1.SoyC12_01G042400 MISSING
glyma.FengDiHuang.gnm1.ann1.P6HL.gene_models_main glyma.FengDiHuang.gnm1.ann1.SoyL07_01G041300 MISSING
glyma.FiskebyIII.gnm1.ann1.SS25.gene_models_main glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.01G052300 GlymaFiskIII.01G052300
glyma.HanDouNo_5.gnm1.ann1.ZS7M.gene_models_main glyma.HanDouNo_5.gnm1.ann1.SoyC09_01G038500 MISSING
glyma.Hefeng25_IGA1002.gnm1.ann1.320V.gene_models_main glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_01R004633 SoyHF25_01R004633
glyma.HeiHeNo_43.gnm1.ann1.PDXG.gene_models_main glyma.HeiHeNo_43.gnm1.ann1.SoyC13_01G041300 MISSING
glyma.Huaxia3_IGA1007.gnm1.ann1.LKC7.gene_models_main glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_01G113000 SoyHX3_01G113000
glyma.Hwangkeum.gnm1.ann1.1G4F.gene_models_main glyma.Hwangkeum.gnm1.ann1.GmHk_01G000541 exosc3_1
glyma.JD17.gnm1.ann1.CLFP.gene_models_main glyma.JD17.gnm1.ann1.JD001G0045100 JD001G0045100
glyma.JiDouNo_17.gnm1.ann1.X5PX.gene_models_main glyma.JiDouNo_17.gnm1.ann1.SoyC11_01G038200 MISSING
glyma.JinDouNo_23.gnm1.ann1.SGJW.gene_models_main glyma.JinDouNo_23.gnm1.ann1.SoyC07_01G039100 MISSING
glyma.Jinyuan_IGA1006.gnm1.ann1.2NNX.gene_models_main glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_01G119400 SoyJY_01G119400
glyma.JuXuanNo_23.gnm1.ann1.H8PW.gene_models_main glyma.JuXuanNo_23.gnm1.ann1.SoyC03_01G041000 MISSING
glyma.KeShanNo_1.gnm1.ann1.2YX4.gene_models_main glyma.KeShanNo_1.gnm1.ann1.SoyC14_01G040900 MISSING
glyma.Lee.gnm1.ann1.6NZV.gene_models_main glyma.Lee.gnm1.ann1.GlymaLee.01G069600 glyma.Lee.gnm1.ann1.GlymaLee.01G069600
glyma.Lee.gnm2.ann1.1FNT.gene_models_main glyma.Lee.gnm2.ann1.Gm_00676 Gm_00676
glyma.PI_398296.gnm1.ann1.B0XR.gene_models_main glyma.PI_398296.gnm1.ann1.SoyL05_01G037500 MISSING
glyma.PI_548362.gnm1.ann1.LL84.gene_models_main glyma.PI_548362.gnm1.ann1.SoyC10_01G038700 MISSING
glyma.QiHuangNo_34.gnm1.ann1.WHRV.gene_models_main glyma.QiHuangNo_34.gnm1.ann1.SoyC08_01G039600 MISSING
glyma.ShiShengChangYe.gnm1.ann1.VLGS.gene_models_main glyma.ShiShengChangYe.gnm1.ann1.SoyL09_01G041900 MISSING
glyma.TieFengNo_18.gnm1.ann1.7GR4.gene_models_main glyma.TieFengNo_18.gnm1.ann1.SoyC02_01G036400 MISSING
glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G040100 MISSING
glyma.TongShanTianEDan.gnm1.ann1.56XW.gene_models_main glyma.TongShanTianEDan.gnm1.ann1.SoyL03_01G037700 MISSING
glyma.WanDouNo_28.gnm1.ann1.NLYP.gene_models_main glyma.WanDouNo_28.gnm1.ann1.SoyC04_01G043200 MISSING
glyma.Wenfeng7_IGA1001.gnm1.ann1.ZK5W.gene_models_main glyma.Wenfeng7_IGA1001.gnm1.ann1.SoyWF7_01R004450 SoyWF7_01R004450
glyma.Wm82_IGA1008.gnm1.ann1.FGN6.gene_models_main glyma.Wm82_IGA1008.gnm1.ann1.SoyW82_01G112700 SoyW82_01G112700
glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G058800 glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G058800
glyma.Wm82.gnm1.ann1.DvBy.gene_models_main glyma.Wm82.gnm1.ann1.Glyma01g21510 Glyma01g21510
glyma.Wm82.gnm2.ann1.RVB6.gene_models_main glyma.Wm82.gnm2.ann1.Glyma.01G067400 Glyma.01G067400
glyma.Wm82.gnm4.ann1.T8TQ.gene_models_main glyma.Wm82.gnm4.ann1.Glyma.01G069300 Glyma.01G069300
glyma.XuDouNo_1.gnm1.ann1.G2T7.gene_models_main glyma.XuDouNo_1.gnm1.ann1.SoyC01_01G041100 MISSING
glyma.YuDouNo_22.gnm1.ann1.HCQ1.gene_models_main glyma.YuDouNo_22.gnm1.ann1.SoyC06_01G040300 MISSING
glyma.Zh13_IGA1005.gnm1.ann1.87Z5.gene_models_main glyma.Zh13_IGA1005.gnm1.ann1.SoyZH13_01R004482 SoyZH13_01R004482
glyma.Zh13.gnm1.ann1.8VV3.gene_models_main glyma.Zh13.gnm1.ann1.SoyZH13_01G104400 MISSING
glyma.Zh13.gnm2.ann1.FJ3G.gene_models_main glyma.Zh13.gnm2.ann1.SoyZH13_01G052900 SoyZH13_01G052900
glyma.Zh35_IGA1004.gnm1.ann1.RGN6.gene_models_main glyma.Zh35_IGA1004.gnm1.ann1.SoyZH35_01G113100 SoyZH35_01G113100
glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.gene_models_main glyma.ZhangChunManCangJin.gnm1.ann1.SoyL06_01G039700 MISSING
glyma.Zhutwinning2.gnm1.ann1.ZTTQ.gene_models_main glyma.Zhutwinning2.gnm1.ann1.SoyL01_01G044300 MISSING
glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.gene_models_main glyma.ZiHuaNo_4.gnm1.ann1.SoyL02_01G037100 MISSING
glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_01R004562 SoyGsojaF_01R004562
glyso.PI_549046.gnm1.ann1.65KD.gene_models_main glyso.PI_549046.gnm1.ann1.SoyW02_01G040000 MISSING
glyso.PI_562565.gnm1.ann1.1JD2.gene_models_main glyso.PI_562565.gnm1.ann1.SoyW01_01G042900 MISSING
glyso.PI_578357.gnm1.ann1.0ZKP.gene_models_main glyso.PI_578357.gnm1.ann1.SoyW03_01G041400 MISSING
glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main glyso.PI483463.gnm1.ann1.GlysoPI483463.01G083300 glyso.PI483463.gnm1.ann1.GlysoPI483463.01G083300
glyso.W05.gnm1.ann1.T47J.gene_models_main glyso.W05.gnm1.ann1.Glysoja.01G000660 glyso.W05.gnm1.ann1.Glysoja.01G000660
glyst.G1974.gnm1.ann1.F257.gene_models_main glyst.G1974.gnm1.ann1.Gst1g000848 Gst1g000848
glysy.G1300.gnm1.ann1.RRK6.gene_models_main glysy.G1300.gnm1.ann1.Gsy1g000816 Gsy1g000816
labpu.Highworth.gnm1.ann1.HJ3B.gene_models_main labpu.Highworth.gnm1.ann1.Labpu01g010580 Labpu01g010580
lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820 Lcu.2RBY.1g010820
lener.IG_72815.gnm1.ann1.R90F.gene_models_main lener.IG_72815.gnm1.ann1.Ler.1DRT.1g009870 Ler.1DRT.1g009870
lotja.MG20.gnm3.ann1.WF9B.gene_models_main lotja.MG20.gnm3.ann1.Lj0g3v0018699 Lj0g3v0018699
lupal.Amiga.gnm1.ann1.3GKS.gene_models_main lupal.Amiga.gnm1.ann1.gene:Lalb_Chr01g0003511 lupal.Lalb_Chr01g0003511
lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main lupan.Tanjil.gnm1.ann1.Lup031547 lupan.Lup031547
medsa.XinJiangDaYe.gnm1.ann1.RKB9.gene_models_main medsa.XinJiangDaYe.gnm1.ann1.MS.gene57458 MS.gene57458
medtr.A17_HM341.gnm4.ann2.G3ZY.gene_models_main medtr.A17_HM341.gnm4.ann2.Medtr1g017870 MISSING
medtr.A17.gnm5.ann1_6.L2RX.gene_models_main medtr.A17.gnm5.ann1_6.MtrunA17Chr1g0152141 MtrunA17Chr1g0152141
medtr.HM004.gnm1.ann1.2XTB.gene_models_main medtr.HM004.gnm1.ann1.g876 HM004.g876
medtr.HM010.gnm1.ann1.WV9J.gene_models_main medtr.HM010.gnm1.ann1.g889 HM010.g889
medtr.HM022.gnm1.ann1.6C8N.gene_models_main medtr.HM022.gnm1.ann1.g842 HM022.g842
medtr.HM023.gnm1.ann1.WZN8.gene_models_main medtr.HM023.gnm1.ann1.g8172 HM023.g8172
medtr.HM034.gnm1.ann1.YR6S.gene_models_main medtr.HM034.gnm1.ann1.g906 HM034.g906
medtr.HM050.gnm1.ann1.GWRX.gene_models_main medtr.HM050.gnm1.ann1.g943 HM050.g943
medtr.HM056.gnm1.ann1.CHP6.gene_models_main medtr.HM056.gnm1.ann1.g854 medtr.HM056.g854
medtr.HM058.gnm1.ann1.LXPZ.gene_models_main medtr.HM058.gnm1.ann1.g4535 medtr.HM058.g4535
medtr.HM060.gnm1.ann1.H41P.gene_models_main medtr.HM060.gnm1.ann1.g4091 medtr.HM060.g4091
medtr.HM095.gnm1.ann1.55W4.gene_models_main medtr.HM095.gnm1.ann1.g4212 medtr.HM095.g4212
medtr.HM125.gnm1.ann1.KY5W.gene_models_main medtr.HM125.gnm1.ann1.g894 medtr.HM125.g894
medtr.HM129.gnm1.ann1.7FTD.gene_models_main medtr.HM129.gnm1.ann1.g857 medtr.HM129.g857
medtr.HM185.gnm1.ann1.GB3D.gene_models_main medtr.HM185.gnm1.ann1.g843 medtr.HM185.g843
medtr.HM324.gnm1.ann1.SQH2.gene_models_main medtr.HM324.gnm1.ann1.g3873 medtr.HM324.g3873
medtr.R108_HM340.gnm1.ann1.85YW.gene_models_main medtr.R108_HM340.gnm1.ann1.BZG31_000s010470 BZG31_000s010470
medtr.R108.gnmHiC_1.ann1.Y8NH.gene_models_main medtr.R108.gnmHiC_1.ann1.MtrunR108HiC_000814 MtrunR108HiC000814
phaac.Frijol_Bayo.gnm1.ann1.ML22.gene_models_main phaac.Frijol_Bayo.gnm1.ann1.Phacu.CVR.001G056200 Phacu.CVR.001G056200
phaac.W6_15578.gnm2.ann1.LVZ1.gene_models_main phaac.W6_15578.gnm2.ann1.Phacu.WLD.001G078700 Phacu.WLD.001G078700
phalu.G27455.gnm1.ann1.JD7C.gene_models_main phalu.G27455.gnm1.ann1.Pl01G0000102600 Pl01G0000102600
phavu.5-593.gnm1.ann1.3FBJ.gene_models_main phavu.5-593.gnm1.ann1.Pv5-593.01G066100 Pv5-593.01G066100
phavu.G19833.gnm1.ann1.pScz.gene_models_main phavu.G19833.gnm1.ann1.Phvul.001G088500 Phvul.001G088500
phavu.G19833.gnm2.ann1.PB8d.gene_models_main phavu.G19833.gnm2.ann1.Phvul.001G072100 Phvul.001G072100
phavu.LaborOvalle.gnm1.ann1.L1DY.gene_models_main phavu.LaborOvalle.gnm1.ann1.PvLabOv.01G061600 PvLabOv.01G061600
phavu.UI111.gnm1.ann1.8L4N.gene_models_main phavu.UI111.gnm1.ann1.PvUI111.01G074900 PvUI111.01G074900
pissa.Cameor.gnm1.ann1.7SZR.gene_models_main pissa.Cameor.gnm1.ann1.Psat1g022840 MISSING
tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main tripr.MilvusB.gnm2.ann1.gene6109 tripr.gene6109
trisu.Daliak.gnm2.ann1.MFKF.gene_models_main trisu.Daliak.gnm2.ann1.Ts_00473 Ts_00473
vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main vicfa.Hedin2.gnm1.ann1.1g243680 MISSING
vicfa.Tiffany.gnm1.ann1.Y54X.gene_models_main vicfa.Tiffany.gnm1.ann1.1g192680 MISSING
vigan.Gyeongwon.gnm3.ann1.3Nz5.gene_models_main vigan.Gyeongwon.gnm3.ann1.Vang0002ss00040 Vang0002ss00040
vigan.Shumari.gnm1.ann1.8BRS.gene_models_main vigan.Shumari.gnm1.ann1.Vigan.01G113600 Vigan.01G113600
vigra.VC1973A.gnm6.ann1.M1Qs.gene_models_main vigra.VC1973A.gnm6.ann1.Vradi01g06280 Vradi01g06280
vigra.VC1973A.gnm7.ann1.RWBG.gene_models_main vigra.VC1973A.gnm7.ann1.Vradi01g00000765 Vradi01g00000765
vigun.CB5-2.gnm1.ann1.0GKC.gene_models_main vigun.CB5-2.gnm1.ann1.VuCB5-2.01G046600 VuCB5-2.01G046600
vigun.IT97K-499-35.gnm1.ann1.zb5D.gene_models_main vigun.IT97K-499-35.gnm1.ann1.Vigun01g041700 Vigun01g041700
vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main vigun.IT97K-499-35.gnm1.ann2.Vigun01g053800 vigun.IT97K-499-35.Vigun01g053800
vigun.Sanzi.gnm1.ann1.HFH8.gene_models_main vigun.Sanzi.gnm1.ann1.VuSanzi.01G024700 VuSanzi.01G024700
vigun.Suvita2.gnm1.ann1.1PF6.gene_models_main vigun.Suvita2.gnm1.ann1.VuSuvita2.01G040100 VuSuvita2.01G040100
vigun.TZ30.gnm1.ann2.59NL.gene_models_main vigun.TZ30.gnm1.ann2.VuTZ30.01G049900 VuTZ30.01G049900
vigun.UCR779.gnm1.ann1.VF6G.gene_models_main vigun.UCR779.gnm1.ann1.VuUCR779.01G029300 VuUCR779.01G029300
vigun.Xiabao_II.gnm1.ann1.4JFL.gene_models_main vigun.Xiabao_II.gnm1.ann1.evm.TU.LG1.794 evm.TU.LG1.794
vigun.ZN016.gnm1.ann2.C7YV.gene_models_main vigun.ZN016.gnm1.ann2.VuZN016.01G051800 VuZN016.01G051800
I've added a section to the annotation spec. For your consideration ...
So just to clarify, you specify that Name must match the least-significant portion of the LIS ID. I'm pointing this out because it removes a way we can identify a gene with a non-redundant Name (like in the example of Ensembl names). I'm OK with it, but I want us all to recognize that we're losing some potential functionality with that requirement. In fact, that requirement creates a way to generate name from the ID which means having the Name field populated isn't adding any information. (I already strip the yuck off of ID and store that string as Gene.secondaryIdentifier, which will be identical to Gene.name with this spec, which means I'll just stop loading Gene.secondaryIdentifier if we implement this spec.)
"specify that Name must match the least-significant portion of the LIS ID"
No. The name is significant within the annotation. It excludes the prefix fields that we add to produce the ID.
"that requirement creates a way to generate name from the ID which means having the Name field populated isn't adding any information."
Yes.
"I'll just stop loading Gene.secondaryIdentifier"
I don't know how Gene.secondaryIdentifier is specified, but if it always the same as Name, then sure. I could imagine useful secondaryIdentifiers, such as Refseq IDs or gene symbols; but maybe that's not how the field is used.
It's normally the same as this definition of Name, yes, but differs in that it is generated from ID whereas Name is explicitly listed and therefore could be something entirely different. Gene symbols would be loaded from the Symbol attribute. Refseq IDs probably would have their own attribute like Refseq_ID.
(By "least significant" above I simply meant the yuck-prefix-removed part in the positional sense, like least significant bit.)
I'm fine with all this, just wanted to elucidate the issues. Since browsers simply grab the Name attribute for display, it makes sense to do as you've specified.
If we're all set, I can run a job to parse ID and replace Name (which may contain dots) from gensp.Strain.gnm.ann.Name for those GFFs that don't already have Name as such.
It actually wasn't clear to me from the writeup that the proposal was to have Name = strip_yuck(ID), but sounds like that is indeed what @StevenCannon-USDA meant (and is also probably what is now most commonly done in our files).
Anyway, it sounds like we have a few possibilities for how to proceed:
Regarding the latter option, I note that in the case of the first example given in the proposal from @StevenCannon-USDA: ID | Name | Comment |
---|---|---|
arahy.Tifrunner.gnm1.ann1.HA8THR | HA8THR | OK |
it was actually the case that the Name started life out as arahy.HA8THR and then I believe someone later zealously stripped it off in order to make Name match unprefixed ID. We'd originally intended to have ID=arahy.Tifrunner.gnm1.ann1.HA8THR Name=arahy.HA8THR under the theory that they didn't need to strictly match and we wanted to avoid perceived redundancy in the full-yuck as we had gotten with aradu.V14167.gnm1.ann1.Aradu.KGT5H Aradu.KGT5H
And we now have some examples like: lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820 Lcu.2RBY.1g010820 where the full-yuck redundancy is even worse!
So under the "somewhat independent but recognizably related" proviso for arahy we could return to the original Name=arahy.HA8THR ID=arahy.Tifrunner.gnm1.ann1.HA8THR and in the case of lentil above we could allow: Name=Lcu.2RBY.1g010820 ID=lencu.CDC_Redberry.gnm2.ann1.1g010820 and maybe even go all the way to: Name=Glyma.05G026000 ID=glyma.Wm82.gnm4.ann1.05G026000 !! There's obviously some subjectivity to this way of doing things, though I think it gets relegated to how we'd formulate the non-yuck portion of the ID.
Not sure which of these options I'm actually advocating for, but I do want to avoid Name=GLYMA_05G026000 so we don't have that showing up in JBrowse2 (oh the indignity!) I'm sorry to prolong this thread, really I am. But it will be nice to get it firmly nailed down.
(I can also just set Gene.name to Gene.secondaryIdentifier in the production mines, back-updating them to adhere to this as well.)
I don't see the motivation for the examples you provide, @adf-ncgr . Other than perhaps aesthetic. Is there an actual problem with long IDs? Does lencu.CDC_Redberry.gnm2.ann1.1g010820
solve a problem that lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820
creates? If so, you're now talking about how we create ID in addition to how we create Name (whereas up to know I thought ID is already well-prescribed but we didn't have a specification for Name).
FWIW, as for GLYMA_05G026000
I now store that as Gene.ensemblName and it is generated by a method DatastoreUtils.getEnsemblName(name)
which takes the GFF Name as an argument. In other words, I've got the secret sauce hardcoded in the lis-bio-sources package.
However, one could imagine adding a new attribute ensembl_name
to the GFFs of genes that appear in Ensembl/Plant Reactome so it isn't only the mines that carry that connection.
primaryIdentifier | secondaryIdentifier | name | ensemblName |
---|---|---|---|
glyma.Wm82.gnm2.ann1.Glyma.05G026000 | Glyma.05G026000 | Glyma.05G026000 | GLYMA_05G026000 |
glyma.Wm82.gnm4.ann1.Glyma.05G026000 | Glyma.05G026000 | Glyma.05G026000 | GLYMA_05G026000 |
@sammyjava we do have some cases in which sequenceserver BLAST libraries fail to build because the full yuck ID exceeds their character limit for sequence IDs (<=50 if I recall correctly). So that's a problem...
I had forgotten the history regarding arahy.Tifrunner.gnm1.ann1. (never mind that I was one of the responsible parties).
Following the principle from the spec that "Where available in the original annotations, the names should come from those annotation files, with the possible exception of stripping type identifiers (e.g. "gene:"), or shortening exceptionally cumbersome auto-generated strings or lengthy prefixes added in the original annotation form if those prefixes do not contribute to the uniqueness of the names within the annotation file. Such exceptions will need to be considered on a case-by-casse basis."
I think the more proper solution would be - as I think you point out @adf-ncgr: ID=arahy.Tifrunner.gnm1.ann1.arahy.HA8THR Name=arahy.HA8THR
... and regarding my example of ID=lupal.Amiga.gnm1.ann1.gene:Lalb_Chr01g0003511
, I would say that the ID is in error (or at least is unfortunate): I would strip "gene:" - which I think is the artifact of this group's prefixing of identifiers with GFF element type (which some people do, but I think is bad practice).
I agree. In general we should drop "this is a gene" identifier components like "gene:". Of course it's a gene! It's in a gene record!
I'm fine with dropping "gene" (for aesthetic reasons!) although:
I do think we need to come to an agreement about the "excessively length ID" issue, for which I can track down the specific example if needed (probably lentil, though not %100 positive)
Yeah, this is probably a bigger issue involving more consistent use of the ID field as well as the Name field. In fact, if Name is simply a substring of ID, @adf-ncgr has kicked the can up to more tightly specifying ID. Which is a good thing IMO.
may or may not be considered relevant, but I'll note that in a proposal associated with new TAIR-managed/NCBI-generated/community-curated annotations for Arabidopsis thaliana they refer to the AgBioData genome nomenclature working group recommendation which is fairly similar to our full-yuck scheme.
BUT, they also seem to deviate from the scheme of slavish prefixing (ie the base of the identifier is slightly modified between "primary" and "secondary" identifiers, although they are not exactly using the language of gff in their description:
Locus Names: Primary identifier: AGI id (At1g01010 or At1TE01010) Secondary identifier:
ddAraThal.Col-0.Col-CC2.1.1G000001 for protein-coding and ncRNA genes Number before the G is the chromosome (1-5, M/C) Six digit zero padded number increments from the top of the chromosome to the bottom of the chromosome ddAraThal.Col-0.Col-CC2.1.1TE000001 for transposable elements Number before the TE is the chromosome (1-5, M/C) Six digit zero padded number increments from the top of the chromosome to the bottom of the chromosome Additional name/s: Symbol/s + full name/s
Needless to say, it seems like their "secondary identifier" would be playing the role of what we've been calling ID/primary identifier and their "primary identifier" is like our use of "Name". Also note that I'm not really advocating to go from gensp to "ToLID" (= Tree of Life ID) even if "drGlyMaxx" is pretty funky! (unclear if it's fully official but that's what a search at https://id.tol.sanger.ac.uk/search turns up).
Of course, we are (probably) not in a position to fully specify "primary identifiers" (aka our Names) to the extent TAIR is, unless we want to break with our own tradition and just ignore the original group's identification scheme (which in some cases might be considered a kindness). But I think the original intent of the featid_map file (e.g. glyma.JD17.gnm1.ann1.CLFP.featid_map.tsv.gz) was to allow some freedom in this regard (ie not merely to make explicit the addition of prefixes).
This task arises out of https://github.com/legumeinfo/datastore-issues/issues/178 which seems to have generated enough discussion to warrant specifying the Name attribute for genes in our annotation GFFs.
Suggested (by Sam) requirements:
I'll assign this to @StevenCannon-USDA since he seems to be the closest thing to the originator of Gene Name Policy in our existing annotation collections. Feel free to update the above requirements, of course, those are my input.