ian-small / chloe

5 stars 6 forks source link

Feature request - account for frameshifts #12

Open chrisjackson-pellicle opened 4 months ago

chrisjackson-pellicle commented 4 months ago

Hi Ian,

Thanks very much for this program - overall it's done a great job of our plastid genome (and it's so fast!).

I have a feature request, if possible. I noticed that for one of our genes, the plastid genome contig contains a frameshift that introduces a premature stop codon about halfway through, and the output GFF3 file only annotates this 5' half.

The truncation in the recorded length of this gene occurs when the setlongestORF! function is run for this feature. Would it be possible to extend Chloe to allow for these scenarios, perhaps by optionally allowing multiple non-overlapping ORFs within a given feature, with a corresponding note in field 9 of the GFF3?

Cheers,

Chris

chrisjackson-pellicle commented 4 months ago

Some additional information:

Read mapping suggests that our plDNA assembly is in fact correct. So, rather than a frameshift causing the issue with this gene (ccsA), it's likely a small 19 bp intron, as also seen in the ccsA annotation for the Nepenthes khasiana plDNA (our plDNA is also from a Nepenthes species).

I see that Chloe expects a single exon for ccsA based on the gold-standard reference plDNAs, and hence only a single ORF is searched for in the corresponding feature. Perhaps optionally allowing multiple non-overlapping ORFs within a given feature as suggested above would also allow a more complete annotation in cases where exon number expectations are not met?

Also, a general caveat and apology if my understanding of Chloe's process isn't correct - I'm still getting my head around some of the code!

Cheers,

Chris

ian-small commented 4 months ago

Hi Chris,

Sorry to not be more responsive, I’m just about to head overseas for 3 weeks and am scrambling to finish stuff off… You’re quite right that in its current guise, Chloe does easily detect or annotate premature stops; it could add a warning if the ‘annotation stack’ extends well beyond the first in-frame stop, but I don’t think it checks for that at the moment. It’s much more of an issue for hornworts, lycophytes and ferns where U-to-C editing allows many genes to include premature stops that will be removed post-transcriptionally. Handling such complications has been shelved for the time being with a number of other things that we don’t really need to deal with for angiosperms. We have frozen the Chloe’s features at the moment to (at last) prepare a publication, which will focus on annotating angiosperm chloroplast genomes. Once that’s done, we’ll return to the code and start adding in what’s needed to deal with other chloroplast genomes.

In your particular case, a 19 nt intron sounds extremely implausible, I know of no mechanism in chloroplasts that could splice that out. If you have any RNA-seq data you can check, of course. It sounds much more likely to me that ccsA is a pseudogene in Nepenthes. A frameshift would also be extremely unusual for chloroplasts. In general, our aim is not to annotate pseudogenes with Chloe. We think that in most cases, incorrectly annotating a pseudogene as a functional gene (false positive) is far worse than not annotating it at all (arguably not even a false negative). If Chloe is annotating your potentially truncated ccsA with no warnings, then that’s Chloe’s mistake, I feel.

Cheers and best wishes Ian

From: Chris Jackson @.> Date: Friday, 17 May 2024 at 1:40 PM To: ian-small/chloe @.> Cc: Subscribed @.***> Subject: Re: [ian-small/chloe] Feature request - account for frameshifts (Issue #12)

Some additional information:

Read mapping suggests that our plDNA assembly is in fact correct. So, rather than a frameshift causing the issue with this gene (ccsA), it's likely a small 19 bp intron, as also seen in the ccsA annotation for the Nepenthes khasiana plDNAhttps://www.ncbi.nlm.nih.gov/nuccore/NC_051455.1 (our plDNA is also from a Nepenthes species).

I see that Chloe expects a single exon for ccsA based on the gold-standard reference plDNAs, and hence only a single ORF is searched for in the corresponding feature. Perhaps optionally allowing multiple non-overlapping ORFs within a given feature as suggested above would also allow a more complete annotation in cases where exon number expectations are not met?

Also, a general caveat and apology if my understanding of Chloe's process isn't correct - I'm still getting my head around some of the code!

Cheers,

Chris

— Reply to this email directly, view it on GitHubhttps://github.com/ian-small/chloe/issues/12#issuecomment-2116719377, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD4WKG34VGZNACHHRA2BXCLZCWJ6LAVCNFSM6AAAAABHZOJ4DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJWG4YTSMZXG4. You are receiving this because you are subscribed to this thread.Message ID: @.***>

chrisjackson-pellicle commented 3 months ago

Hi Ian,

Thanks for the reply. Ah, I hadn't stopped to consider the biology of a 19 bp plDNA 'intron' - oops. And yes, I do not see any reads spliced across the 19 bp 'intron' when I map our RNAseq data. So, a likely pseudogene it is!

For the moment, I've forked Choe and added a warning if any predicted gene is less than 80% (default, can be changed with --short_gene_warning_threshold) of the combined non-intron median_length values.

I hope the publication goes smoothly - it's a great tool!

Cheers,

Chris