Specify how gene Name attributes are given

sammyjava commented 9 months ago

This task arises out of https://github.com/legumeinfo/datastore-issues/issues/178 which seems to have generated enough discussion to warrant specifying the Name attribute for genes in our annotation GFFs.

Suggested (by Sam) requirements:

consistency: a given gene in two different assembly/annotation versions has the same name
usability: a gene name relates an LIS gene record to a useful non-LIS data repository, when possible
simplicity: a gene name should be minimally unique within a GFF: i.e. not contain prefixes common to all genes in the annotation (unless forced to for some other reason)

I'll assign this to @StevenCannon-USDA since he seems to be the closest thing to the originator of Gene Name Policy in our existing annotation collections. Feel free to update the above requirements, of course, those are my input.

sammyjava commented 9 months ago

And I'll assign myself to back-editing the existing annotation collections to follow the specification established from this task. The goal will be to have the updated annotations ready for the mine 5.1.0.4 load in the future.

StevenCannon-USDA commented 9 months ago

Starting with diagnosis. Here's what we have now:

  for filepath in /usr/local/www/data/v2/*/*/annotations/*/*gene_models_main.gff3.gz ; do
    export base=`basename $filepath .gff3.gz`
    zcat $filepath | grep -v "#" | head -10000 |
      awk -v BN=$base '$3~/gene/ {print BN "\t" $9}' | tail -1 |
      perl -lane '$base=$F[0]; @attrs=split(";", $F[1]); 
                  @id=grep(/ID=/, @attrs); 
                  @name=grep(/Name=/, @attrs); 
                  unless (defined $name[0] ){$name[0]="MISSING"};
                  $id[0] =~ s/ID=//;
                  $name[0] =~ s/Name=//;
                  print join("\t", $base, $id[0], $name[0]);
                 ' 
  done

aesev.CIAT22838.gnm1.ann1.ZM3R.gene_models_main aesev.CIAT22838.gnm1.ann1.Ae01g07470    Ae01g07470
aradu.V14167.gnm1.ann1.cxSM.gene_models_main    aradu.V14167.gnm1.ann1.Aradu.KGT5H  Aradu.KGT5H
arahy.BaileyII.gnm1.ann1.PQM7.gene_models_main  arahy.BaileyII.gnm1.ann1.mikado.chr01G571   mikado.chr01G571
arahy.Tifrunner.gnm1.ann1.CCJH.gene_models_main arahy.Tifrunner.gnm1.ann1.HA8THR    HA8THR
arahy.Tifrunner.gnm2.ann1.4K0L.gene_models_main arahy.Tifrunner.gnm2.ann1.HA8THR    HA8THR
arahy.Tifrunner.gnm2.ann2.PVFB.gene_models_main arahy.Tifrunner.gnm2.ann2.Ah01g088800   Ah01g088800
araip.K30076.gnm1.ann1.J37m.gene_models_main    araip.K30076.gnm1.ann1.Araip.L423N  Araip.L423N
cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main cajca.ICPL87119.gnm1.ann1.C.cajan_04851 cajca.C.cajan_04851
cajca.ICPL87119.gnm2.ann1.L3ZH.gene_models_main cajca.ICPL87119.gnm2.ann1.Cc_00501  Cc_00501
cerca.ISC453364.gnm3.ann1.3N1M.gene_models_main cerca.ISC453364.gnm3.ann1.Cecan.1G059700    Cecan.1G059700
cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main   cicar.CDCFrontier.gnm1.ann1.Ca_02646    cicar.CDCFrontier.Ca_02646
cicar.CDCFrontier.gnm2.ann1.9M1L.gene_models_main   cicar.CDCFrontier.gnm2.ann1.Ca_00491    Ca_00491
cicar.CDCFrontier.gnm3.ann1.NPD7.gene_models_main   cicar.CDCFrontier.gnm3.ann1.Ca1g082300  Ca1g082300
cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main   cicar.ICC4958.gnm2.ann1.Ca_00646    cicar.ICC4958.Ca_00646
cicec.S2Drd065.gnm1.ann1.YZ9H.gene_models_main  cicec.S2Drd065.gnm1.ann1.Ce0g133700 Ce0g133700
cicre.Besev079.gnm1.ann1.F01Z.gene_models_main  cicre.Besev079.gnm1.ann1.Cr1g085100 Cr1g085100
faial.WAFC.gnm1.ann1.RTP9.gene_models_main  faial.WAFC.gnm1.ann1.Faial112S01341 MISSING
glycy.G1267.gnm1.ann1.HRFD.gene_models_main glycy.G1267.gnm1.ann1.Gcy1g000849   Gcy1g000849
glyd3.G1403.gnm1.ann1.XNZQ.gene_models_main glyd3.G1403.gnm1.ann1.Gto1g000840   Gto1g000840
glydo.G1134.gnm1.ann1.4BJM.gene_models_main glydo.G1134.gnm1.ann1.Gtt1g000960   Gtt1g000960
glyfa.G1718.gnm1.ann1.2KSV.gene_models_main glyfa.G1718.gnm1.ann1.Gfa1g000870   Gfa1g000870
glyma.58-161.gnm1.ann1.HJ1K.gene_models_main    glyma.58-161.gnm1.ann1.SoyL04_01G042100 MISSING
glyma.Amsoy.gnm1.ann1.6S5P.gene_models_main glyma.Amsoy.gnm1.ann1.SoyC05_01G042200  MISSING
glyma.DongNongNo_50.gnm1.ann1.QSDB.gene_models_main glyma.DongNongNo_50.gnm1.ann1.SoyC12_01G042400  MISSING
glyma.FengDiHuang.gnm1.ann1.P6HL.gene_models_main   glyma.FengDiHuang.gnm1.ann1.SoyL07_01G041300    MISSING
glyma.FiskebyIII.gnm1.ann1.SS25.gene_models_main    glyma.FiskebyIII.gnm1.ann1.GlymaFiskIII.01G052300   GlymaFiskIII.01G052300
glyma.HanDouNo_5.gnm1.ann1.ZS7M.gene_models_main    glyma.HanDouNo_5.gnm1.ann1.SoyC09_01G038500 MISSING
glyma.Hefeng25_IGA1002.gnm1.ann1.320V.gene_models_main  glyma.Hefeng25_IGA1002.gnm1.ann1.SoyHF25_01R004633  SoyHF25_01R004633
glyma.HeiHeNo_43.gnm1.ann1.PDXG.gene_models_main    glyma.HeiHeNo_43.gnm1.ann1.SoyC13_01G041300 MISSING
glyma.Huaxia3_IGA1007.gnm1.ann1.LKC7.gene_models_main   glyma.Huaxia3_IGA1007.gnm1.ann1.SoyHX3_01G113000    SoyHX3_01G113000
glyma.Hwangkeum.gnm1.ann1.1G4F.gene_models_main glyma.Hwangkeum.gnm1.ann1.GmHk_01G000541    exosc3_1
glyma.JD17.gnm1.ann1.CLFP.gene_models_main  glyma.JD17.gnm1.ann1.JD001G0045100  JD001G0045100
glyma.JiDouNo_17.gnm1.ann1.X5PX.gene_models_main    glyma.JiDouNo_17.gnm1.ann1.SoyC11_01G038200 MISSING
glyma.JinDouNo_23.gnm1.ann1.SGJW.gene_models_main   glyma.JinDouNo_23.gnm1.ann1.SoyC07_01G039100    MISSING
glyma.Jinyuan_IGA1006.gnm1.ann1.2NNX.gene_models_main   glyma.Jinyuan_IGA1006.gnm1.ann1.SoyJY_01G119400 SoyJY_01G119400
glyma.JuXuanNo_23.gnm1.ann1.H8PW.gene_models_main   glyma.JuXuanNo_23.gnm1.ann1.SoyC03_01G041000    MISSING
glyma.KeShanNo_1.gnm1.ann1.2YX4.gene_models_main    glyma.KeShanNo_1.gnm1.ann1.SoyC14_01G040900 MISSING
glyma.Lee.gnm1.ann1.6NZV.gene_models_main   glyma.Lee.gnm1.ann1.GlymaLee.01G069600  glyma.Lee.gnm1.ann1.GlymaLee.01G069600
glyma.Lee.gnm2.ann1.1FNT.gene_models_main   glyma.Lee.gnm2.ann1.Gm_00676    Gm_00676
glyma.PI_398296.gnm1.ann1.B0XR.gene_models_main glyma.PI_398296.gnm1.ann1.SoyL05_01G037500  MISSING
glyma.PI_548362.gnm1.ann1.LL84.gene_models_main glyma.PI_548362.gnm1.ann1.SoyC10_01G038700  MISSING
glyma.QiHuangNo_34.gnm1.ann1.WHRV.gene_models_main  glyma.QiHuangNo_34.gnm1.ann1.SoyC08_01G039600   MISSING
glyma.ShiShengChangYe.gnm1.ann1.VLGS.gene_models_main   glyma.ShiShengChangYe.gnm1.ann1.SoyL09_01G041900    MISSING
glyma.TieFengNo_18.gnm1.ann1.7GR4.gene_models_main  glyma.TieFengNo_18.gnm1.ann1.SoyC02_01G036400   MISSING
glyma.TieJiaSiLiHuang.gnm1.ann1.W70Z.gene_models_main   glyma.TieJiaSiLiHuang.gnm1.ann1.SoyL08_01G040100    MISSING
glyma.TongShanTianEDan.gnm1.ann1.56XW.gene_models_main  glyma.TongShanTianEDan.gnm1.ann1.SoyL03_01G037700   MISSING
glyma.WanDouNo_28.gnm1.ann1.NLYP.gene_models_main   glyma.WanDouNo_28.gnm1.ann1.SoyC04_01G043200    MISSING
glyma.Wenfeng7_IGA1001.gnm1.ann1.ZK5W.gene_models_main  glyma.Wenfeng7_IGA1001.gnm1.ann1.SoyWF7_01R004450   SoyWF7_01R004450
glyma.Wm82_IGA1008.gnm1.ann1.FGN6.gene_models_main  glyma.Wm82_IGA1008.gnm1.ann1.SoyW82_01G112700   SoyW82_01G112700
glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main    glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G058800    glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G058800
glyma.Wm82.gnm1.ann1.DvBy.gene_models_main  glyma.Wm82.gnm1.ann1.Glyma01g21510  Glyma01g21510
glyma.Wm82.gnm2.ann1.RVB6.gene_models_main  glyma.Wm82.gnm2.ann1.Glyma.01G067400    Glyma.01G067400
glyma.Wm82.gnm4.ann1.T8TQ.gene_models_main  glyma.Wm82.gnm4.ann1.Glyma.01G069300    Glyma.01G069300
glyma.XuDouNo_1.gnm1.ann1.G2T7.gene_models_main glyma.XuDouNo_1.gnm1.ann1.SoyC01_01G041100  MISSING
glyma.YuDouNo_22.gnm1.ann1.HCQ1.gene_models_main    glyma.YuDouNo_22.gnm1.ann1.SoyC06_01G040300 MISSING
glyma.Zh13_IGA1005.gnm1.ann1.87Z5.gene_models_main  glyma.Zh13_IGA1005.gnm1.ann1.SoyZH13_01R004482  SoyZH13_01R004482
glyma.Zh13.gnm1.ann1.8VV3.gene_models_main  glyma.Zh13.gnm1.ann1.SoyZH13_01G104400  MISSING
glyma.Zh13.gnm2.ann1.FJ3G.gene_models_main  glyma.Zh13.gnm2.ann1.SoyZH13_01G052900  SoyZH13_01G052900
glyma.Zh35_IGA1004.gnm1.ann1.RGN6.gene_models_main  glyma.Zh35_IGA1004.gnm1.ann1.SoyZH35_01G113100  SoyZH35_01G113100
glyma.ZhangChunManCangJin.gnm1.ann1.7HPB.gene_models_main   glyma.ZhangChunManCangJin.gnm1.ann1.SoyL06_01G039700    MISSING
glyma.Zhutwinning2.gnm1.ann1.ZTTQ.gene_models_main  glyma.Zhutwinning2.gnm1.ann1.SoyL01_01G044300   MISSING
glyma.ZiHuaNo_4.gnm1.ann1.FCFQ.gene_models_main glyma.ZiHuaNo_4.gnm1.ann1.SoyL02_01G037100  MISSING
glyso.F_IGA1003.gnm1.ann1.G61B.gene_models_main glyso.F_IGA1003.gnm1.ann1.SoyGsojaF_01R004562   SoyGsojaF_01R004562
glyso.PI_549046.gnm1.ann1.65KD.gene_models_main glyso.PI_549046.gnm1.ann1.SoyW02_01G040000  MISSING
glyso.PI_562565.gnm1.ann1.1JD2.gene_models_main glyso.PI_562565.gnm1.ann1.SoyW01_01G042900  MISSING
glyso.PI_578357.gnm1.ann1.0ZKP.gene_models_main glyso.PI_578357.gnm1.ann1.SoyW03_01G041400  MISSING
glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main  glyso.PI483463.gnm1.ann1.GlysoPI483463.01G083300    glyso.PI483463.gnm1.ann1.GlysoPI483463.01G083300
glyso.W05.gnm1.ann1.T47J.gene_models_main   glyso.W05.gnm1.ann1.Glysoja.01G000660   glyso.W05.gnm1.ann1.Glysoja.01G000660
glyst.G1974.gnm1.ann1.F257.gene_models_main glyst.G1974.gnm1.ann1.Gst1g000848   Gst1g000848
glysy.G1300.gnm1.ann1.RRK6.gene_models_main glysy.G1300.gnm1.ann1.Gsy1g000816   Gsy1g000816
labpu.Highworth.gnm1.ann1.HJ3B.gene_models_main labpu.Highworth.gnm1.ann1.Labpu01g010580    Labpu01g010580
lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main  lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820  Lcu.2RBY.1g010820
lener.IG_72815.gnm1.ann1.R90F.gene_models_main  lener.IG_72815.gnm1.ann1.Ler.1DRT.1g009870  Ler.1DRT.1g009870
lotja.MG20.gnm3.ann1.WF9B.gene_models_main  lotja.MG20.gnm3.ann1.Lj0g3v0018699  Lj0g3v0018699
lupal.Amiga.gnm1.ann1.3GKS.gene_models_main lupal.Amiga.gnm1.ann1.gene:Lalb_Chr01g0003511   lupal.Lalb_Chr01g0003511
lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main    lupan.Tanjil.gnm1.ann1.Lup031547    lupan.Lup031547
medsa.XinJiangDaYe.gnm1.ann1.RKB9.gene_models_main  medsa.XinJiangDaYe.gnm1.ann1.MS.gene57458   MS.gene57458
medtr.A17_HM341.gnm4.ann2.G3ZY.gene_models_main medtr.A17_HM341.gnm4.ann2.Medtr1g017870 MISSING
medtr.A17.gnm5.ann1_6.L2RX.gene_models_main medtr.A17.gnm5.ann1_6.MtrunA17Chr1g0152141  MtrunA17Chr1g0152141
medtr.HM004.gnm1.ann1.2XTB.gene_models_main medtr.HM004.gnm1.ann1.g876  HM004.g876
medtr.HM010.gnm1.ann1.WV9J.gene_models_main medtr.HM010.gnm1.ann1.g889  HM010.g889
medtr.HM022.gnm1.ann1.6C8N.gene_models_main medtr.HM022.gnm1.ann1.g842  HM022.g842
medtr.HM023.gnm1.ann1.WZN8.gene_models_main medtr.HM023.gnm1.ann1.g8172 HM023.g8172
medtr.HM034.gnm1.ann1.YR6S.gene_models_main medtr.HM034.gnm1.ann1.g906  HM034.g906
medtr.HM050.gnm1.ann1.GWRX.gene_models_main medtr.HM050.gnm1.ann1.g943  HM050.g943
medtr.HM056.gnm1.ann1.CHP6.gene_models_main medtr.HM056.gnm1.ann1.g854  medtr.HM056.g854
medtr.HM058.gnm1.ann1.LXPZ.gene_models_main medtr.HM058.gnm1.ann1.g4535 medtr.HM058.g4535
medtr.HM060.gnm1.ann1.H41P.gene_models_main medtr.HM060.gnm1.ann1.g4091 medtr.HM060.g4091
medtr.HM095.gnm1.ann1.55W4.gene_models_main medtr.HM095.gnm1.ann1.g4212 medtr.HM095.g4212
medtr.HM125.gnm1.ann1.KY5W.gene_models_main medtr.HM125.gnm1.ann1.g894  medtr.HM125.g894
medtr.HM129.gnm1.ann1.7FTD.gene_models_main medtr.HM129.gnm1.ann1.g857  medtr.HM129.g857
medtr.HM185.gnm1.ann1.GB3D.gene_models_main medtr.HM185.gnm1.ann1.g843  medtr.HM185.g843
medtr.HM324.gnm1.ann1.SQH2.gene_models_main medtr.HM324.gnm1.ann1.g3873 medtr.HM324.g3873
medtr.R108_HM340.gnm1.ann1.85YW.gene_models_main    medtr.R108_HM340.gnm1.ann1.BZG31_000s010470 BZG31_000s010470
medtr.R108.gnmHiC_1.ann1.Y8NH.gene_models_main  medtr.R108.gnmHiC_1.ann1.MtrunR108HiC_000814    MtrunR108HiC000814
phaac.Frijol_Bayo.gnm1.ann1.ML22.gene_models_main   phaac.Frijol_Bayo.gnm1.ann1.Phacu.CVR.001G056200    Phacu.CVR.001G056200
phaac.W6_15578.gnm2.ann1.LVZ1.gene_models_main  phaac.W6_15578.gnm2.ann1.Phacu.WLD.001G078700   Phacu.WLD.001G078700
phalu.G27455.gnm1.ann1.JD7C.gene_models_main    phalu.G27455.gnm1.ann1.Pl01G0000102600  Pl01G0000102600
phavu.5-593.gnm1.ann1.3FBJ.gene_models_main phavu.5-593.gnm1.ann1.Pv5-593.01G066100 Pv5-593.01G066100
phavu.G19833.gnm1.ann1.pScz.gene_models_main    phavu.G19833.gnm1.ann1.Phvul.001G088500 Phvul.001G088500
phavu.G19833.gnm2.ann1.PB8d.gene_models_main    phavu.G19833.gnm2.ann1.Phvul.001G072100 Phvul.001G072100
phavu.LaborOvalle.gnm1.ann1.L1DY.gene_models_main   phavu.LaborOvalle.gnm1.ann1.PvLabOv.01G061600   PvLabOv.01G061600
phavu.UI111.gnm1.ann1.8L4N.gene_models_main phavu.UI111.gnm1.ann1.PvUI111.01G074900 PvUI111.01G074900
pissa.Cameor.gnm1.ann1.7SZR.gene_models_main    pissa.Cameor.gnm1.ann1.Psat1g022840 MISSING
tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main   tripr.MilvusB.gnm2.ann1.gene6109    tripr.gene6109
trisu.Daliak.gnm2.ann1.MFKF.gene_models_main    trisu.Daliak.gnm2.ann1.Ts_00473 Ts_00473
vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main    vicfa.Hedin2.gnm1.ann1.1g243680 MISSING
vicfa.Tiffany.gnm1.ann1.Y54X.gene_models_main   vicfa.Tiffany.gnm1.ann1.1g192680    MISSING
vigan.Gyeongwon.gnm3.ann1.3Nz5.gene_models_main vigan.Gyeongwon.gnm3.ann1.Vang0002ss00040   Vang0002ss00040
vigan.Shumari.gnm1.ann1.8BRS.gene_models_main   vigan.Shumari.gnm1.ann1.Vigan.01G113600 Vigan.01G113600
vigra.VC1973A.gnm6.ann1.M1Qs.gene_models_main   vigra.VC1973A.gnm6.ann1.Vradi01g06280   Vradi01g06280
vigra.VC1973A.gnm7.ann1.RWBG.gene_models_main   vigra.VC1973A.gnm7.ann1.Vradi01g00000765    Vradi01g00000765
vigun.CB5-2.gnm1.ann1.0GKC.gene_models_main vigun.CB5-2.gnm1.ann1.VuCB5-2.01G046600 VuCB5-2.01G046600
vigun.IT97K-499-35.gnm1.ann1.zb5D.gene_models_main  vigun.IT97K-499-35.gnm1.ann1.Vigun01g041700 Vigun01g041700
vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main  vigun.IT97K-499-35.gnm1.ann2.Vigun01g053800 vigun.IT97K-499-35.Vigun01g053800
vigun.Sanzi.gnm1.ann1.HFH8.gene_models_main vigun.Sanzi.gnm1.ann1.VuSanzi.01G024700 VuSanzi.01G024700
vigun.Suvita2.gnm1.ann1.1PF6.gene_models_main   vigun.Suvita2.gnm1.ann1.VuSuvita2.01G040100 VuSuvita2.01G040100
vigun.TZ30.gnm1.ann2.59NL.gene_models_main  vigun.TZ30.gnm1.ann2.VuTZ30.01G049900   VuTZ30.01G049900
vigun.UCR779.gnm1.ann1.VF6G.gene_models_main    vigun.UCR779.gnm1.ann1.VuUCR779.01G029300   VuUCR779.01G029300
vigun.Xiabao_II.gnm1.ann1.4JFL.gene_models_main vigun.Xiabao_II.gnm1.ann1.evm.TU.LG1.794    evm.TU.LG1.794
vigun.ZN016.gnm1.ann2.C7YV.gene_models_main vigun.ZN016.gnm1.ann2.VuZN016.01G051800 VuZN016.01G051800

StevenCannon-USDA commented 9 months ago

I've added a section to the annotation spec. For your consideration ...

sammyjava commented 9 months ago

So just to clarify, you specify that Name must match the least-significant portion of the LIS ID. I'm pointing this out because it removes a way we can identify a gene with a non-redundant Name (like in the example of Ensembl names). I'm OK with it, but I want us all to recognize that we're losing some potential functionality with that requirement. In fact, that requirement creates a way to generate name from the ID which means having the Name field populated isn't adding any information. (I already strip the yuck off of ID and store that string as Gene.secondaryIdentifier, which will be identical to Gene.name with this spec, which means I'll just stop loading Gene.secondaryIdentifier if we implement this spec.)

StevenCannon-USDA commented 9 months ago

"specify that Name must match the least-significant portion of the LIS ID"

No. The name is significant within the annotation. It excludes the prefix fields that we add to produce the ID.

"that requirement creates a way to generate name from the ID which means having the Name field populated isn't adding any information."

Yes.

"I'll just stop loading Gene.secondaryIdentifier"

I don't know how Gene.secondaryIdentifier is specified, but if it always the same as Name, then sure. I could imagine useful secondaryIdentifiers, such as Refseq IDs or gene symbols; but maybe that's not how the field is used.

sammyjava commented 9 months ago

It's normally the same as this definition of Name, yes, but differs in that it is generated from ID whereas Name is explicitly listed and therefore could be something entirely different. Gene symbols would be loaded from the Symbol attribute. Refseq IDs probably would have their own attribute like Refseq_ID.

(By "least significant" above I simply meant the yuck-prefix-removed part in the positional sense, like least significant bit.)

I'm fine with all this, just wanted to elucidate the issues. Since browsers simply grab the Name attribute for display, it makes sense to do as you've specified.

If we're all set, I can run a job to parse ID and replace Name (which may contain dots) from gensp.Strain.gnm.ann.Name for those GFFs that don't already have Name as such.

adf-ncgr commented 9 months ago

It actually wasn't clear to me from the writeup that the proposal was to have Name = strip_yuck(ID), but sounds like that is indeed what @StevenCannon-USDA meant (and is also probably what is now most commonly done in our files).

Anyway, it sounds like we have a few possibilities for how to proceed:

require both ID and Name be made explicit and require that ID is always full yuck + Name (@StevenCannon-USDA proposal)
don't use Name at all (e.g. some applications like GCV ignore them and just display ID anyway; JBrowse2 would do this if Names weren't present; BLAST will use what's in the FASTA header, which is ID)
allow independence of ID and Name (possibly adding a proviso that they should be "recognizably related" to one another)

Regarding the latter option, I note that in the case of the first example given in the proposal from @StevenCannon-USDA: ID	Name	Comment
arahy.Tifrunner.gnm1.ann1.HA8THR	HA8THR	OK

it was actually the case that the Name started life out as arahy.HA8THR and then I believe someone later zealously stripped it off in order to make Name match unprefixed ID. We'd originally intended to have ID=arahy.Tifrunner.gnm1.ann1.HA8THR Name=arahy.HA8THR under the theory that they didn't need to strictly match and we wanted to avoid perceived redundancy in the full-yuck as we had gotten with aradu.V14167.gnm1.ann1.Aradu.KGT5H Aradu.KGT5H

And we now have some examples like: lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820 Lcu.2RBY.1g010820 where the full-yuck redundancy is even worse!

So under the "somewhat independent but recognizably related" proviso for arahy we could return to the original Name=arahy.HA8THR ID=arahy.Tifrunner.gnm1.ann1.HA8THR and in the case of lentil above we could allow: Name=Lcu.2RBY.1g010820 ID=lencu.CDC_Redberry.gnm2.ann1.1g010820 and maybe even go all the way to: Name=Glyma.05G026000 ID=glyma.Wm82.gnm4.ann1.05G026000 !! There's obviously some subjectivity to this way of doing things, though I think it gets relegated to how we'd formulate the non-yuck portion of the ID.

Not sure which of these options I'm actually advocating for, but I do want to avoid Name=GLYMA_05G026000 so we don't have that showing up in JBrowse2 (oh the indignity!) I'm sorry to prolong this thread, really I am. But it will be nice to get it firmly nailed down.

sammyjava commented 9 months ago

(I can also just set Gene.name to Gene.secondaryIdentifier in the production mines, back-updating them to adhere to this as well.)

sammyjava commented 9 months ago

I don't see the motivation for the examples you provide, @adf-ncgr . Other than perhaps aesthetic. Is there an actual problem with long IDs? Does lencu.CDC_Redberry.gnm2.ann1.1g010820 solve a problem that lencu.CDC_Redberry.gnm2.ann1.Lcu.2RBY.1g010820 creates? If so, you're now talking about how we create ID in addition to how we create Name (whereas up to know I thought ID is already well-prescribed but we didn't have a specification for Name).

sammyjava commented 9 months ago

FWIW, as for GLYMA_05G026000 I now store that as Gene.ensemblName and it is generated by a method DatastoreUtils.getEnsemblName(name) which takes the GFF Name as an argument. In other words, I've got the secret sauce hardcoded in the lis-bio-sources package.

However, one could imagine adding a new attribute ensembl_name to the GFFs of genes that appear in Ensembl/Plant Reactome so it isn't only the mines that carry that connection.

primaryIdentifier	secondaryIdentifier	name	ensemblName
glyma.Wm82.gnm2.ann1.Glyma.05G026000	Glyma.05G026000	Glyma.05G026000	GLYMA_05G026000
glyma.Wm82.gnm4.ann1.Glyma.05G026000	Glyma.05G026000	Glyma.05G026000	GLYMA_05G026000

adf-ncgr commented 9 months ago

@sammyjava we do have some cases in which sequenceserver BLAST libraries fail to build because the full yuck ID exceeds their character limit for sequence IDs (<=50 if I recall correctly). So that's a problem...

StevenCannon-USDA commented 9 months ago

I had forgotten the history regarding arahy.Tifrunner.gnm1.ann1. (never mind that I was one of the responsible parties).

Following the principle from the spec that "Where available in the original annotations, the names should come from those annotation files, with the possible exception of stripping type identifiers (e.g. "gene:"), or shortening exceptionally cumbersome auto-generated strings or lengthy prefixes added in the original annotation form if those prefixes do not contribute to the uniqueness of the names within the annotation file. Such exceptions will need to be considered on a case-by-casse basis."

I think the more proper solution would be - as I think you point out @adf-ncgr: ID=arahy.Tifrunner.gnm1.ann1.arahy.HA8THR Name=arahy.HA8THR

... and regarding my example of ID=lupal.Amiga.gnm1.ann1.gene:Lalb_Chr01g0003511, I would say that the ID is in error (or at least is unfortunate): I would strip "gene:" - which I think is the artifact of this group's prefixing of identifiers with GFF element type (which some people do, but I think is bad practice).

sammyjava commented 9 months ago

I agree. In general we should drop "this is a gene" identifier components like "gene:". Of course it's a gene! It's in a gene record!

adf-ncgr commented 9 months ago

I'm fine with dropping "gene" (for aesthetic reasons!) although:

we do include "CDS" and "exon" in IDs when we have to generate them (tacking them onto the Parent ID);
there's tripr.MilvusB.gnm2.ann1.gene6109 where we haven't dropped it not to mention medtr.HM004.gnm1.ann1.g876 in which I suspect g = gene

I do think we need to come to an agreement about the "excessively length ID" issue, for which I can track down the specific example if needed (probably lentil, though not %100 positive)

sammyjava commented 9 months ago

Yeah, this is probably a bigger issue involving more consistent use of the ID field as well as the Name field. In fact, if Name is simply a substring of ID, @adf-ncgr has kicked the can up to more tightly specifying ID. Which is a good thing IMO.

adf-ncgr commented 9 months ago

may or may not be considered relevant, but I'll note that in a proposal associated with new TAIR-managed/NCBI-generated/community-curated annotations for Arabidopsis thaliana they refer to the AgBioData genome nomenclature working group recommendation which is fairly similar to our full-yuck scheme.

BUT, they also seem to deviate from the scheme of slavish prefixing (ie the base of the identifier is slightly modified between "primary" and "secondary" identifiers, although they are not exactly using the language of gff in their description:

Locus Names: Primary identifier: AGI id (At1g01010 or At1TE01010) Secondary identifier:
ddAraThal.Col-0.Col-CC2.1.1G000001 for protein-coding and ncRNA genes Number before the G is the chromosome (1-5, M/C) Six digit zero padded number increments from the top of the chromosome to the bottom of the chromosome ddAraThal.Col-0.Col-CC2.1.1TE000001 for transposable elements Number before the TE is the chromosome (1-5, M/C) Six digit zero padded number increments from the top of the chromosome to the bottom of the chromosome Additional name/s: Symbol/s + full name/s

Needless to say, it seems like their "secondary identifier" would be playing the role of what we've been calling ID/primary identifier and their "primary identifier" is like our use of "Name". Also note that I'm not really advocating to go from gensp to "ToLID" (= Tree of Life ID) even if "drGlyMaxx" is pretty funky! (unclear if it's fully official but that's what a search at https://id.tol.sanger.ac.uk/search turns up).

Of course, we are (probably) not in a position to fully specify "primary identifiers" (aka our Names) to the extent TAIR is, unless we want to break with our own tradition and just ignore the original group's identification scheme (which in some cases might be considered a kindness). But I think the original intent of the featid_map file (e.g. glyma.JD17.gnm1.ann1.CLFP.featid_map.tsv.gz) was to allow some freedom in this regard (ie not merely to make explicit the addition of prefixes).

legumeinfo / datastore-specifications

Specify how gene Name attributes are given #44