Genes missing Note attribute

sammyjava commented 1 year ago

I fixed a validator bug recently that had allowed Note-free genes to be loaded. Now I'm hitting GFFs with genes missing notes. This example has four. They don't have gene family assignments, of course. Shall I just insert some standard Note like "Protein of unknown function" for these? Or do you @adf-ncgr want to revisit them as they turn up? I presume they are simply unknown putative proteins. But I want to continue validating Note since we sometimes get GFFs with a lot of missing Note attributes that should actually be fixed (since I'm the Datastore QA Guy).

(Reminder: I do not require that all gene records have GO terms. I require that at least ONE gene record has GO terms.)

[shokin@peanutbase-stage ~/v2/Arachis/hypogaea/annotations/Tifrunner.gnm1.ann1.CCJH]$ zgrep "   gene    " arahy.Tifrunner.gnm1.ann1.CCJH.gene_models_main.gff3.gz | grep -v Note
arahy.Tifrunner.gnm1.Arahy.05   maker   gene    106151243   106152660   .   +   .   ID=arahy.Tifrunner.gnm1.ann1.1J2ERG;Name=1J2ERG
arahy.Tifrunner.gnm1.Arahy.12   maker   gene    49970454    49979593    .   +   .   ID=arahy.Tifrunner.gnm1.ann1.HJ3QQG;Name=HJ3QQG
arahy.Tifrunner.gnm1.Arahy.13   maker   gene    11134134    11135896    .   +   .   ID=arahy.Tifrunner.gnm1.ann1.P0GE2S;Name=P0GE2S
arahy.Tifrunner.gnm1.Arahy.14   maker   gene    116023224   116027506   .   -   .   ID=arahy.Tifrunner.gnm1.ann1.NXFW6D;Name=NXFW6D

adf-ncgr commented 1 year ago

Looks like these "genes" have no subfeatures, and hence have no mRNA/proteins in the fastas. So that's why they didn't get any functional descriptors added (no protein to homologize). I'd say we should probably just yank them, though if they aren't causing any actual problems we could also just let them be. Somewhere there's a GCV that knows about them, but that doesn't mean they should be allowed to bask among the immortals

sammyjava commented 1 year ago

Well, as this issue states, they are causing problems - blocking loading due to validation failure. So I'll yank 'em and do so in future cases of this situation. Thanks.

adf-ncgr commented 1 year ago

Let's not discard them automatically in future- I am comfortable dropping them here since I know for a fact that this gene annotation did not include any non-coding genes, but I think we may have some that do. We'll need to decide how to handle such cases at some point.

sammyjava commented 1 year ago

Well I can just not load the genes that lack Notes. I can change the validation to a warning rather than an error if it's only a few genes that lack Note (rather than all of them or a large fraction of them).

Let's say up to 100 genes will be allowed to lack Note and will not be loaded into the mine, but will spit out a warning saying how many lack Note; more than 100 will produce an error which aborts the load.

adf-ncgr commented 1 year ago

I guess I'm not clear on why a gene without a Note couldn't be loaded into the mine if we thought it was appropriate to have them as non-descript; I thought the validation was basically just to catch situations where we forgot to do our functional annotation at all. I like the idea of having a check, but I don't think we ought to be dropping genes without any review whatsoever.

sammyjava commented 1 year ago

That's fine, if that's what you want. It's a question of what we call "good" versus "bad" data. Maybe all our data is "bad" so we don't have to worry about it. :) I'll not skip genes that lack Note and I'll allow up to 100 Note-free genes to be loaded. Above that I think we need to ask ourselves why we can't functionally annotate so many supposed genes.

adf-ncgr commented 1 year ago

Sounds good. Fun fact: the NCBI annotation of arahy.Tifrunner.gnm1 has >25k genes of biotype snoRNA. Bad!

sammyjava commented 1 year ago

Unacceptable! In any case, I've implemented this now. Warning on 1-100 genes with no Note, error on >100.

legumeinfo / datastore-issues

Genes missing Note attribute #185