ncbi / egapx

Eukaryotic Genome Annotation Pipeline-External caller scripts and documentation
Other
77 stars 7 forks source link

Missing single exon gene predictions #28

Open abs-yy opened 2 months ago

abs-yy commented 2 months ago

Hi, I'm looking to predict genes for my insect genome. For most genes the prediction is working out good, but egapx is missing predictions for some species specific, single exon genes (experiementally validated and also has RNA-Seq read support). Other software like Braker does predict them. Is there some kind of filtering that removes single exon genes? I see remove_single_exon_est_models in default_taks_params.yaml, might this be it?

murphyte commented 1 month ago

species specific, single exon genes (experiementally validated and also has RNA-Seq read support)

Do you have protein evidence for these? RNA-seq will support both protein-coding and non-coding transcripts, and single-exon lncRNAs are not uncommon. EGAPx by default omits single-exon lncRNAs because there's often uncertainty about the correct strand for single-exon genes.

Single-exon predictions with protein alignment support should get retained, but that can depend on protein length and if they have supporting proteins in the evidence set. There's also some logic that we haven't added in yet to help identify these. I'm hoping we'll get at least a preliminary version of that added with the next release (coming soon, maybe a month but saying that may jinx it).

Turning off the remove_single_exon_est_models parameter will indeed keep more of these, but can pick up a LOT of noise from transposons. There's some filtering for those also coming with the next release, along with orthology and naming, although that logic is not as robust as I'd like in the insects (and tends to vary with group). So turning off remove_single_exon_est_models is a "use with extreme caution" option.

abs-yy commented 1 month ago

Thanks for the reply!

Do you have protein evidence for these? Yes, we do have protein evidence (detection with proteomics data) for the genes.

I'll be looking forward to the update!

murphyte commented 1 month ago

detection with proteomics data

cool, great to see real evidence. If you have any that you're confident are real/supported, but have sparse support by BLASTp in the current RefSeq dataset, it can be helpful to let us know (write in to the help desk at: https://support.nlm.nih.gov/support/create-case/). If there are some homologs, but they're sparse, you can point a few of those protein accessions out. If there aren't any, but maybe you can find some transcripts (lncRNAs) via tBLASTn of the refseq RNA database (it can help to restrict to insects to get faster results), then that's also helpful. Or a protein sequence in a pinch and we can go searching in whole genomes.

We can use that to curate the protein in a few species, which in turn can help provide evidence for automatic annotation in others. What we consider to be real proteins are rarely species specific, and for example pretty much everything in the current human annotation can be found in other species (including non-primate species in nearly all cases). Proteomics will also find other peptides that aren't conserved, but their origin isn't clear (e.g. some ancillary translation off lncRNAs that's inconsistent and doesn't generate stable proteins), and we generally don't consider them to be protein-coding genes without evidence of function or conservation (selective constraint as evidence of function).

The tricky ones are the peptide hormones, which tend to be small with limited regions of conservation. That's probably one of the more challenging areas of annotation.