legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Annotation collections with genes in GFA that are not in GFF #154

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago
adf-ncgr commented 1 year ago

I suspect the issue with Tifrunner.gnm2.ann1.4K0L has to do with the way this was constructed by liftover of genes from the older assembly to the newer. Probably some did not successfully lift and the other files were just given the old s/gnm1/gnm2/ treatment, as not having coordinate info, and were hence retained. I will not be sorry when the new gnm2.ann2 annotations are ready for the datastore, but I suppose the thing to do for these would be regenerated the fastas using gffread and re-rerun gfa.

Will have to look more closely at W05 as I don't know why that would have issues.

adf-ncgr commented 1 year ago

OK, so the reason these files have gfa entries referring to genes not in the gff is that there are proteins (and corresponding transcripts and CDS) that appear in the fasta files but which correspond to nothing in the gff. Will it suffice to remove the gfa entries with the gene whose existence was inferred erroneously or do we also need to clean up the fasta? I have some hesitation in doing so which isn't entirely due to being lazy; but if having extra entries in the fastas will cause errors in loading I can go ahead with the further clean-up. It's really only one "gene" in the case of W05 but impacts 133 in Tifrunner.gnm2.ann1 (which again, is a consequence of the way these genes were lifted-over from the older to newer assembly)

sammyjava commented 1 year ago

I think the simple and correct thing is to just remove those genes from the GFA. Those are not genes for which we can assign pathways, because they don't exist in the annotation. Non gene-associated proteins existing in the annotation is their own business.

adf-ncgr commented 1 year ago

Which would you consider more correct, to suppress the GFA record altogether or to leave the gene id empty but indicate the assignment of the protein to the family?

sammyjava commented 1 year ago

Remove the entire record.

adf-ncgr commented 1 year ago

Ok should be good to go. W05 is a classic datastory, with one gene having been moved into its own "bad_gene" gff file for reasons I'm not clear on. But I moved the gfa record into a parallel holding pen. Having a "bad_gene" file seems vaguely like eugenics...