Options to improve annotations and eliminate premature stop codons?

sheinasim commented 1 week ago

Hello!

I'm having a slight issue with the annotations generated by EGAPx. When I translate the CDS into a protein sequence, I'm seeing a lot of premature stop codons. This is a species where I am supplying RNA-seq, RNA-seq in SRA (other people's sequences), and a protein sequence file from a previous assembly of the same species.

I'm tryin to annotate the genome of a fly in the family Tephritidae. I'm going to try now to add protein sets from closely related species as well. Will the gene models improve with more evidence even if it is not from the same species?

Is there an option I can use to favor annotations and frames that minimize the number of pseudogenes or premature stop codons?

Thanks! Sheina

murphyte commented 6 days ago

Hi Sheina -- this is partially related to the warning we have about the current version not yet being feature complete or ready for submission. We are working on wrapping up v0.3, which will add functional annotation analysis including logic to classify protein-coding vs pseudogene. That will likely convert a chunk of the CDS annotations into pseudogenes. I'm hoping we'll have that out in early October, depending on if any issues arise in our ongoing pre-release testing.

a protein sequence file from a previous assembly of the same species.

You generally don't need to do this. The default dipteran protein file should work well in combination with RNA-seq. Including proteins from an automated annotation of the same species can have some adverse effects where errors in any proteins get locked in which is less likely to happen when aligning cross-species proteins. You'll also get more pseudogenes annotated when aligning same species proteins than when relying only on cross-species, so that might be elevating your pseudogene count. Our goal is to make EGAPx easy, with little need to customize anything, and the default sets are designed to cover most everything.

Note we do seeing varying rates of models with internal frameshifts or nonsense codons that EGAP winds up classifying as protein-coding with the designation LOW QUALITY PROTEIN. For dipterans, across 98 genomes currently in RefSeq it looks like that averages ~250 genes per genome, with a median of ~150. Sometimes that rate can be elevated, particularly in genomes based on non-HiFi PacBio that haven't been polished, which can have a higher rate of indels that adversely affect gene models. But my suspicion is this is an effect of (a) not yet having the functional annotation logic, and (b) aligning same-species proteins.

sheinasim commented 5 days ago

Hi Terence,

Thank you for your reply!

For now, I will use the output where I did not supply it with a protein fasta from a previous assembly.

Our genome was made from PacBio HiFi reads, so I wouldn't expect any CLR related frameshifts or need for polishing. Thanks for that pseudogene number, I will compare it to what I found.

I'm using AGAT to translate the cds to protein sequences, is there another program you would recommend?

Thanks again, and I look forward to the new release!

Best wishes, Sheina

murphyte commented 5 days ago

Great, PacBio HiFi has made a world of difference.

We'll provide protein FASTA output as part of the v0.3 release.

AGAT works fine for now. It's just that the pseudogenes aren't labeled yet.

ncbi / egapx

Options to improve annotations and eliminate premature stop codons? #29