Gene names inconsistently have gensp.Strain prepended

sammyjava commented 1 year ago

This came up because I'm rewriting how I assign pathways to genes, and, of course, Plant Reactome uses gene names that rarely align exactly with ours. BUT, more than that, I see that we sometimes prepend gensp to the gene name (in the GFF Name attribute) and sometimes we don't. Shouldn't we try to be consistent on this? We don't use the name for anything functional that I know of, so I'd like to get them consistently lacking gensp so they are a bit more like names used elsewhere. Only 9 species have gensp.Name, and that's not consistently done within those species, either (see example in second comment).

starts_with	count
aesev.	0
araca.	0
aradu.	0
arahy.	0
araip.	0
arast.	0
cajca.	40071
cerca.	0
cicar.	58526
cicec.	0
cicre.	0
glycy.	0
glyd3.	0
glydo.	0
glyfa.	0
glyma.	96036
glyso.	102507
glyst.	0
glysy.	0
labpu.	0
lencu.	0
lener.	0
lotja.	0
lupal.	41657
lupan.	33072
medpo.	0
medru.	0
medsa.	0
medtr.	517176
phaac.	0
phalu.	0
phavu.	0
pissa.	0
tripr.	39948
trisu.	0
vicfa.	0
vigan.	0
vigra.	0
vigun.	31948

sammyjava commented 1 year ago

Here's an example of inconsistent gensp prepending within a species:

primaryidentifier	name
cicar.ICC4958.gnm2.ann1.Ca_10073	cicar.ICC4958.Ca_10073
cicar.CDCFrontier.gnm1.ann1.Ca_10073	cicar.CDCFrontier.Ca_10073
cicar.CDCFrontier.gnm2.ann1.Ca_10073	Ca_10073

sammyjava commented 1 year ago

And I'll add that gene names need NOT be unique between annotations. They're just easy-to-read gene names. There's nothing functionally wrong (AFAIK) if all three genes above have the name Ca_10073 , which I'd actually like since I'm guessing they're orthologs, anyway. The primary identifier (from GFF ID) must be unique across the entire Datastore for a given feature type.

sammyjava commented 1 year ago

Any complaints about me going ahead with fixing the Name records in the GFFs to be short and consistent? I'd like get this done since I'm working with gene names. @StevenCannon-USDA @adf-ncgr

Here's an example of what I'd do:

Current:

ID	Name
vigun.IT97K-499-35.gnm1.ann1.Vigun04g097100	Vigun04g097100
vigun.IT97K-499-35.gnm1.ann2.Vigun04g097100	vigun.IT97K-499-35.Vigun04g097100

Fixed:

ID	Name
vigun.IT97K-499-35.gnm1.ann1.Vigun04g097100	Vigun04g097100
vigun.IT97K-499-35.gnm1.ann2.Vigun04g097100	Vigun04g097100

adf-ncgr commented 1 year ago

actually I thought you'd already done something like this (ie Name = strip_full_yuck(ID)).

We used to use name in certain multi-species/strain display contexts (in fact, labeling for display is what the GFF3 spec says it is for), which is why we prepended gensp (sometimes gensp.strain- as in the case of cicar, where in fact cicar.ICC4958.Ca_10073 and cicar.CDCFrontier.Ca_10073 were NOT "orthologs"); but we've since opted to use full yuck IDs in these contexts instead.

I will note that Name = strip_full_yuck(ID) could potentially result in loss of info from the original files, since sometimes they come in with Names that are not just variants on the ID, e.g.: medtr.A17.gnm5.MtrunA17Chr1 EuGene mRNA 44665910 44672319 . - . ID=medtr.A17.gnm5.ann1_6.MtrunA17Chr1g0197491.1;Parent=medtr.A17.gnm5.ann1_6.MtrunA17Chr1g0197491;Name=MtRPG%7CMtrunA17Chr1g0197491;locus_tag=MtrunA17_Chr1g0197491;product=hypothetical%20protein;Ontology_term=GO:0005634;Dbxref=Coils:Coil,InterPro:IPR019448,PANTHER:PTHR32484,PANTHER:PTHR32484:SF4,Pfam:PF10358 (and yes, that's an mRNA record not a gene, but in principle it could happen in a gene record- would not be surprised if in fact we did already clobber it in the gene record for this gff3 file in the datastore).

Maybe this would be a good discussion point for tomorrow's LIS/PB meeting, just so we're all sure we are on the same page about it before proceeding to change stuff.

sammyjava commented 1 year ago

I never said I'd strip the full yuck to get the Name. I said I'd strip the LIS-prepended gensp.Strain, which leaves a name that may be different from the secondaryIdentifier (which IS the yuck-stripped minimal identifier). I think the end result is that I'll get our Name attributes closer to what was "in the original source" since that's what we're supposed to be doing, as opposed to prepending characters for "certain multi-species/strain display contexts." Correct?

I'd rather not get into these weeds in the LIS meeting if we can avoid it. I don't think anyone uses the Name attribute for anything functional. But if you insist that it be a discussion tomorrow, put it on the agenda and I'll wait.

adf-ncgr commented 1 year ago

OK, I guess I thought secondaryIdentifier had some relationship with Name (more than "usually equal to"). Fine with me to strip gensp.Strain, but what are you planning to do that will depend on these values? Do we need to have some policy in regard to how they are handled in future or are you just trying to get things to look more consistent? (ie if in future we get an original gff3 that happens to have gensp. or gensp.Strain. as part of their naming would we leave as is, in deference to the doctrine of original intent or strip in deference to the doctrine of looking consistent?) Regarding discussion, I guess I'm just curious if the souschef does anything special regarding Names. I'll look at the code.

sammyjava commented 1 year ago

Mainly consistency, but also operating under the belief that biologists tend to use a name like "Vigun04g097100" rather than "vigun.IT97K-499-35.Vigun04g097100", and the Name attribute is mainly meant to provide a way for biologists to find a gene based on non-LIS use. (See, for example, the email that I just got referencing "Vradi04g00001979",which seemed very timely.)

Consistency matters because if Vigun04g097100 and vigun.IT97K-499-35.Vigun04g097100 are actually the same gene in two different annotations (ann1 and ann2 in this example), I think we should use the same name, which at least suggests that they ARE the same gene. (Of course, different assembly/annotations may assign a given name to different genes , which is why I said suggests rather than guarantees.)

It also makes my "LIS name to Ensembl name" code a bit shorter since I don't have to handle all the weird variations on name. :) But that's not the main reason.

sammyjava commented 1 year ago

Oh, what I'm doing with Name is assigning Plant Reactome pathways and linking out to Ensembl Plants and Plant Reactome. :)

adf-ncgr commented 1 year ago

TLDR: one way or another you will probably still have to handle weird variations on Name

Is there any reason that you can't use strip_full_yuck(ID) for this same purpose of linking to Ensembl? Not that I'm at all opposed to stripping off the gensp/gensp.Strain from Name in cases where we've added them (of course, don't strip off Glyma. from Glyma.16g150700 and "stuff like that"). But I do think it would be nice to try to get everyone on the same page about what/how we (sous-chefs) ought to munge and what we ought not to munge in gffs from the wild.

Also, I could be wrong, but I'm not sure that in general Ensembl Plants will use the "given Name" in gff for their identifiers. I think it's probably a big can of worms and that much like us, they (Gramene/Ensembl) will take a gff as given and then make some subjective decisions on how to make it work for them.

To go back to my earlier example (which isn't all that conclusive since Ensembl still seems to be using Medicago truncatula v4), this is the "original gff record": MtrunA17Chr1 EuGene gene 44665910 44672319 . - . ID=gene:MtrunA17Chr1g0197491;Name=MtRPG;locus_tag=MtrunA17_Chr1g0197491; Note the prepending of "gene:" to the ID and the introduction of an underscore in the "locus_tag", in addition to the biologist-friendly Name. what do you suppose Ensembl will end up doing with this if they ever make the switch? for better or worse, we did this: medtr.A17.gnm5.MtrunA17Chr1 EuGene gene 44665910 44672319 . - . ID=medtr.A17.gnm5.ann1_6.MtrunA17Chr1g0197491;Name=MtrunA17Chr1g0197491;locus_tag=MtrunA17_Chr1g0197491;Dbxref=Coils:Coil,InterPro:IPR019448,PANTHER:PTHR32484,PANTHER:PTHR32484:SF4,Pfam:PF10358;Note=myosin heavy chain-like protein%2C putative%3B IPR019448 (EEIG1/EHBP1 N-terminal domain)

I'd argue we probably should have left "MtRPG", though possibly converting it to an Alias or "symbol". A lot of this probably is splitting hairs since downstream consumers can re-munge the files as they see fit for their purposes, but now that we're serving JBrowse2 directly from the gff in the datastore, it does have some implications for how the browsers will behave.

sammyjava commented 1 year ago

The point is that the Ensembl names have nothing to do with our yuck-stripped names. And that our Name attributes aren't consistent, and therefore do not follow Steven's mantra of "preserve the original names from the original lab." I don't care about the Ensembl names specifically, I've got a routine for converting our names to Ensembl names now. But what I do care about is the same gene being called "Vigun04g097100" in one annotation and "vigun.IT97K-499-35.Vigun04g097100" in another annotation. That's just shoddy, and should be corrected.

As for yuck-strippage, I already do that in SequenceFeature.secondaryIdentifier.

Name can be absolutely anything. It need not resemble the least-significant field of our LIS identifier. I suggested we use the Ensembl name when appropriate. Steven strongly objected, because he says that we get the gene names from the originators of the assembly/annotation, which apparently is not what Ensembl does. But we've obviously not done that consistently, so I want to fix that.

In other words, I'd like a consistent policy about the Gene name. I think Steven spelled that out and I'd like to back-enforce it now.

sammyjava commented 1 year ago

I'd like to repeat for clarity: the Name attribute is a completely independent field in a GFF record. It need not relate IN ANY WAY to the ID value. In fact, if a Name attribute value allows us to connect to other data resources in the world, that's a GOOD USE of it. That's totally what I was thinking when I suggested we use Ensembl gene names when possible. I'm fine with that being opposed, but let's use a USEFUL name for genes, not something that is simply redundant with what's already in the ID.

sammyjava commented 1 year ago

And regarding the Symbol attribute, that has an actual meaning. The symbol for ID=glyma.Wm82_IGA1008.gnm1.ann1.SoyW82_11G145800, Name=SoyW82_11G145800 is REVOLUTA. I don't think we get to make up gene symbols, those result from publications about gene function. (You knew I was going to use that gene as an example.)

sammyjava commented 1 year ago

Closing this in favor of an issue in datastore-specifications to specify how Name is determined. Which we clearly need.

legumeinfo / datastore-issues

Gene names inconsistently have gensp.Strain prepended #178