Closed sammyjava closed 1 year ago
Looks like these "genes" have no subfeatures, and hence have no mRNA/proteins in the fastas. So that's why they didn't get any functional descriptors added (no protein to homologize). I'd say we should probably just yank them, though if they aren't causing any actual problems we could also just let them be. Somewhere there's a GCV that knows about them, but that doesn't mean they should be allowed to bask among the immortals
Well, as this issue states, they are causing problems - blocking loading due to validation failure. So I'll yank 'em and do so in future cases of this situation. Thanks.
Let's not discard them automatically in future- I am comfortable dropping them here since I know for a fact that this gene annotation did not include any non-coding genes, but I think we may have some that do. We'll need to decide how to handle such cases at some point.
Well I can just not load the genes that lack Notes. I can change the validation to a warning rather than an error if it's only a few genes that lack Note (rather than all of them or a large fraction of them).
Let's say up to 100 genes will be allowed to lack Note and will not be loaded into the mine, but will spit out a warning saying how many lack Note; more than 100 will produce an error which aborts the load.
I guess I'm not clear on why a gene without a Note couldn't be loaded into the mine if we thought it was appropriate to have them as non-descript; I thought the validation was basically just to catch situations where we forgot to do our functional annotation at all. I like the idea of having a check, but I don't think we ought to be dropping genes without any review whatsoever.
That's fine, if that's what you want. It's a question of what we call "good" versus "bad" data. Maybe all our data is "bad" so we don't have to worry about it. :) I'll not skip genes that lack Note and I'll allow up to 100 Note-free genes to be loaded. Above that I think we need to ask ourselves why we can't functionally annotate so many supposed genes.
Sounds good. Fun fact: the NCBI annotation of arahy.Tifrunner.gnm1 has >25k genes of biotype snoRNA. Bad!
Unacceptable! In any case, I've implemented this now. Warning on 1-100 genes with no Note, error on >100.
I fixed a validator bug recently that had allowed Note-free genes to be loaded. Now I'm hitting GFFs with genes missing notes. This example has four. They don't have gene family assignments, of course. Shall I just insert some standard Note like "Protein of unknown function" for these? Or do you @adf-ncgr want to revisit them as they turn up? I presume they are simply unknown putative proteins. But I want to continue validating Note since we sometimes get GFFs with a lot of missing Note attributes that should actually be fixed (since I'm the Datastore QA Guy).
(Reminder: I do not require that all gene records have GO terms. I require that at least ONE gene record has GO terms.)