Open sammyjava opened 11 months ago
Here are my opinions about those cases:
cajca.C.cajan_19181 ==> C.cajan_19181
cicar.CDCFrontier.Ca_28062 ==> Ca_28062
cicar.ICC4958.Ca_00001 ==> Ca_00001
glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100
glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050 ==> GmISU01.01G000050 *[1]
glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100 ==> GlysoPI483463.01G000100
glyso.W05.gnm1.ann1.Glysoja.01G000001 ==> Glysoja.01G000001
lupal.Lalb_Chr00c01g0403611 ==> Lalb_Chr00c01g0403611
lupan.Lup027320 ==> Lup027320
medtr.HM324.g1 ==> g1 *[2]
tripr.gene2499 ==> gene2499
vigun.IT97K-499-35.Vigun01g000100 ==> Vigun01g000100
Notes: *[1] Wm82_ISU01 will probably go away soon, to be replaced by an annotation called Wm82.gnm6.ann1, with names looking like glyma.01G00100 (derived from Wm82.gnm4.ann4 names when possible).
*[2] For the Medicago genes, here is a full list of the forms:
for file in */*gene_models_main.gff3.gz; do
zcat $file | awk -v FS="\t" '$3~/gene/ {print $9}' | head -1 |
perl -pe 's/ID=([^;]+);Note=[^;]+;Name=([^;]+);.+/$1\t$2/' |
perl -pe 's/ID=([^;]+);Name=([^;]+);.+/$1\t$2/';
done
medtr.A17_HM341.gnm4.ann2.Medtr1g004930 Medtr1g004930
medtr.A17.gnm5.ann1_6.MtrunA17CPg0492171 MtrunA17CPg0492171
medtr.HM004.gnm1.ann1.g1 HM004.g1
medtr.HM010.gnm1.ann1.g1 HM010.g1
medtr.HM022.gnm1.ann1.g1 HM022.g1
medtr.HM023.gnm1.ann1.g1 HM023.g1
medtr.HM034.gnm1.ann1.g1 HM034.g1
medtr.HM050.gnm1.ann1.g1 HM050.g1
medtr.HM056.gnm1.ann1.g1 medtr.HM056.g1
medtr.HM058.gnm1.ann1.g1 medtr.HM058.g1
medtr.HM060.gnm1.ann1.g1 medtr.HM060.g1
medtr.HM095.gnm1.ann1.g1 medtr.HM095.g1
medtr.HM125.gnm1.ann1.h3436.02 medtr.HM125.h3436.02
medtr.HM129.gnm1.ann1.g1 medtr.HM129.g1
medtr.HM185.gnm1.ann1.g1 medtr.HM185.g1
medtr.HM324.gnm1.ann1.g1 medtr.HM324.g1
medtr.R108_HM340.gnm1.ann1.BZG31_000s000010 BZG31_000s000010
medtr.R108.gnmHiC_1.ann1.MtrunR108HiC_000001 MtrunR108HiC000001
For these, I'd like a second opinion from @adf-ncgr and Joann if appropriate. Regularity says the genes in the Zhou ... Young set should have the form "g#". But I feel a little squeamish about this. These genomes were all released and described as a group, and the assemblies and accessions are referred to in the main paper (Zhou, Silverstein et al., 2017) by the five-character HM### string. That said, I don't see particular instances in the paper where particular genes are discussed by name, so I don't feel I can make a strong argument for going beyond "g#".
Thanks for opinions, @StevenCannon-USDA , I'll put these into a to-do checkbox list here and I'll start updating them after giving @adf-ncgr and @joannmudge a chance to object. As for the Zhou, Silverstein, at al. genomes, I agree that the Name attribute should be just the final piece (g1
) for regularity, but we do sacrifice regularity at times for Higher Reasons.
OK, I'm in agreement with most of these, but I think the original Names for tripr were actually like "Tp57577_TGAC_v2_gene10066" not just "gene10066" so should we use that instead, like Phytozome and Ensembl seem to do? I'm still a little unclear about what the principal is here (originalism or aesthetics), though we once tried to pin it down here: https://github.com/legumeinfo/datastore-specifications/issues/44
I would personally vote to keep medtr.HMxxx.g1 (or at least HMxxx.g1) which is seemingly no more problematic than having GlymaLee as part of a name, it just happens to also be identical with part of our full yuck system. But if we think g1 is better for any given medicago accession, I think that implies that strict Name originalism is the principle here, no matter how bad we think the names are, meaning we should be stuck with Tp57577_TGAC_v2_gene10066.
But whatever we decide, let's take it as an opportunity to resolve the open questions in https://github.com/legumeinfo/datastore-specifications/issues/44
I think we're close to convergence, and are down to the point of splitting hairs - which I guess is unavoidable. Here's the spec as it stands: https://github.com/legumeinfo/datastore-specifications/tree/main/Genus/species/annotations
And a key clause:
Where available in the original annotations, the names should come from those annotation files, with the possible exception of stripping type identifiers (e.g. "gene:"), or shortening exceptionally cumbersome auto-generated strings or lengthy prefixes added in the original annotation form if those prefixes do not contribute to the uniqueness of the names within the annotation file. Such exceptions will need to be considered on a case-by-casse basis.
I would say that "exceptionally cumbersome strings ... if those prefixes do not contribute to the uniqueness of the names within the annotation file" is a fair description of Tp57577_TGAC_v2_gene10066. I mean: the Trifolium team has encoded Genus (T), species (p), accession (57577 I think), sequencing center (TGAC), and assembly version v2. I think this is a worthy case for an exception (shortening it to "gene10066"). But I won't fight anyone over it. If Sam is implementing, I say: go ahead and do what you think is right, and we'll be prepared to be delighted.
Thanks @StevenCannon-USDA, sounds like that clause is indeed the final refuge for the hair-splitters! I am in favor of shortening where there is substantial overlap with what full yuck is accomplishing. I think this would mean that we'd allow:
Name=Lcu.2RBY.1g010820 ID=lencu.CDC_Redberry.gnm2.ann1.1g010820
if we need to keep the IDs below some max length limit imposed by certain tools (e.g. BLAST)? Name here is "original". Or would we require that Name be 1g010820 if we invoked the "lengthy prefixes" clause on this one?
@adf-ncgr - yeah, I think I'd leave Lcu.2RBY.1g010820
(which means changing the ID in that case).
... When you're running an Airbnb and some guests insist on bringing all their own furniture.
This issue (see the title of this issue) is about the gensp prefixes, which it appears we all agree should be dropped. A protocol for how we populate the Name attribute otherwise is certainly a Good Thing. I don't see any argument for keeping the gensp prefix here, so I'll yank those from the appropriate places, and we can move the discussion of Names in general back to https://github.com/legumeinfo/datastore-specifications/issues/44 . I'll keep this issue open just so I can hit my checkboxes.
And yes, in the few cases where Name is full-yuck, I'll de-yuckify it down to the non-yuck portion. (example: glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100).
We talked about this already, but I'd like to take action and update the Datastore where appropriate.
There are a number of
gene_models_main
GFFs that have the Name attribute starting with gensp. This seems non-conformant to me, in the sense that Name is meant to be what a gene is called in the source material and the gensp prefixes tend to be an LIS thing.Here's the list with an example of a GFF line for each case. I'd like @StevenCannon-USDA to confirm that the gene Name attributes should, in fact, contain the gensp prefix in these cases or, when not, to update the GFFs (outsourcing that to me is fine). Also, @adf-ncgr may have some arcane reasons for including the gensp prefix in certain cases. (Name uniqueness does not qualify as a reason, in my opinion, but he may have some JBrowse-related or other reasons for doing so.)
cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main.gff3.gz:
cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main.gff3.gz
cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main.gff3.gz
Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz
Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
Glycine/soja/annotations/PI483463.gnm1.ann1.3Q3Q/glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main.gff3.gz
Glycine/soja/annotations/W05.gnm1.ann1.T47J/glyso.W05.gnm1.ann1.T47J.gene_models_main.gff3.gz
Lupinus/albus/annotations/Amiga.gnm1.ann1.3GKS/lupal.Amiga.gnm1.ann1.3GKS.gene_models_main.gff3.gz
Lupinus/angustifolius/annotations/Tanjil.gnm1.ann1.nnV9/lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM056.gnm1.ann1.CHP6/medtr.HM056.gnm1.ann1.CHP6.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM058.gnm1.ann1.LXPZ/medtr.HM058.gnm1.ann1.LXPZ.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM060.gnm1.ann1.H41P/medtr.HM060.gnm1.ann1.H41P.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM095.gnm1.ann1.55W4/medtr.HM095.gnm1.ann1.55W4.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM125.gnm1.ann1.KY5W/medtr.HM125.gnm1.ann1.KY5W.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM129.gnm1.ann1.7FTD/medtr.HM129.gnm1.ann1.7FTD.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM185.gnm1.ann1.GB3D/medtr.HM185.gnm1.ann1.GB3D.gene_models_main.gff3.gz Medicago/truncatula/annotations/HM324.gnm1.ann1.SQH2/medtr.HM324.gnm1.ann1.SQH2.gene_models_main.gff3.gz
Trifolium/pratense/annotations/MilvusB.gnm2.ann1.DFgp/tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz
Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann2.FD7K/vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main.gff3.gz