Enforce ID=LIS-identifier; Name=whatever in GFFs

sammyjava commented 2 years ago

Up to now I've had a bunch of hacky code in GFFRecordHandler that would make decisions based on what is in the ID attribute and Name attribute. Now, I'd like to regularize all of our GFFs so that the mines exactly reproduce what's in those GFF attributes. ID->primaryIdentifier (unique per feature type) and Name->secondaryIdentifier/name (arbitrary, non-unique).

The exception to the "arbitrary Name" rule is for markers: I merge genetic markers, which are not associated with a genome assembly, with "genomic" markers mapped to a genome assembly in a GFF, so the Name attribute must match the last portion of the ID. This is already in place, just mentioning it here.

The vast majority of GFFs already follow this rule. But there are a few GFFs in which the ID is not an LIS identifier. I would like to update those, as they come up, and make this official since it involves a spec update. My new GFF loader has this assumption.

[x] Sam
[x] Andrew
[x] Steven

StevenCannon-USDA commented 2 years ago

Fine by me. It's our own particular use of ID, but it's GFF3-compliant, as far as I know. I just checked the Glycine max annotations. They all look OK except Wm82.gnm2.ann2.BG1Q, which is the NCBI's RefSeq annotation. This one has been a real headache: faulty parent-child relationships, funky locus names, etc. I finally gave up wrangling the GFF, and derived a new one with gmap, by projecting the transcripts (I think) onto the genome. I expect the results are imperfect, but the GFF at least validates. That file is glyma.Wm82.gnm2.ann2.BG1Q.gene_models_gmap_match.gff3.gz

sammyjava commented 2 years ago

Yeah, the alternative would be to use the Name or another attribute for the LIS identifier, but those aren't required to be unique by the gff validators while the ID field is, and we do want LIS identifiers to be unique at least within a type. (GFF validation enforces unique ID across the entire file.) It's nice to use Name for the originating gene names (which are often quite funky) so the originating folks can "find their own genes."

I'm loading all the attributes and relations that we typically have: ID, Name, Dbxref (partial), Ontology_term, and Note.

BTW, we don't seem to have the popular attribute Symbol. For example, glyma.Lee.gnm1.ann1.GlymaLee.02G198600 has

Note=Cytochrome P450 superfamily protein%3B IPR001128 (Cytochrome P450)%3B GO:0005506 (iron ion binding)%2C GO:0020037 (heme binding)%2C GO:0055114 (oxidation-reduction process)

which Interpro has for short name Cyt_P450 which one might have put into the Symbol field. But I load InterPro, so that gene (and the others from other annotations) will come up on a search on _CytP450, so it's perhaps better to not put that in the GFF, at least from a mine point of view. Just a comment.

adf-ncgr commented 2 years ago

regarding ID = LIS-identifier, totally agree that this should be enforced, which includes ensuring that FASTA sequence ids use them too. The one quasi-exception to this is that CDS records in the GFF will each have distinct IDs (since we are requiring uniqueness, although the GFF spec allows ID non-uniqueness for grouping), while the fasta CDS has identifiers that match the transcript (as do the proteins), ie their Parent in the GFF.

regarding Symbol, whether or not GFF is the best place for this it would be great to start some effort to associate biologist-names with gene model IDs. I expect Soybase maintains something like this for G. max, whereas a French group has done a nice job of curating M. truncatula (see "Gene acronyms vs IDs" under Downloads menu at https://lipm-browsers.toulouse.inra.fr/k/legoo/ ). Those two species would be a good starting place to start developing the "universal translation service". In some sense, it seems like Symbols ought to associate with pangene sets, but there is possibly some ambiguity about gene symbols vs allele symbols.

sammyjava commented 2 years ago

Yeah, I guess Symbol isn't used as much as I thought, in favor of Name being the short name referring to gene function, e.g.

5       araport11       gene    24396981        24402195        .       +       .       ID=gene:AT5G60690;Name=REV;biotype=protein_coding;description=Homeobox-leucine zipper protein REVOLUTA [Source:UniProtKB/Swiss-Prot%3BAcc:Q9SE43];gene_id=AT5G60690;logic_name=araport11
17      ensembl_havana  gene    41196312        41277500        .       -       .       ID=gene:ENSG00000012048;Name=BRCA1;biotype=protein_coding;description=breast cancer 1%2C early onset [Source:HGNC Symbol%3BAcc:1100];gene_id=ENSG00000012048;logic_name=ensembl_havana_gene;version=15

REVOLUTA having to do with how those mutants' leaves grow (not nearly as snarky as many Arabidopsis gene names are) and BRCA1 being BReast CAncer susceptibility protein 1. So perhaps Symbol is appropriate for something separate, like pangene sets, but not _CytP450 in my example above.

sammyjava commented 2 years ago

We all agree on this, so I'm closing it and any non-conformant GFFs that come up shall be duly fixed (or posted to datastore-issues for the fixin').

legumeinfo / datastore-specifications

Enforce ID=LIS-identifier; Name=whatever in GFFs #22