Fragmented gene annotations

hoelzer commented 4 years ago

Hi!

Lately, I experienced fragmented gene annotations in terms of two (or more) ORFs that are in close proximity and are annotated with homology to the same gene. In such cases, Prokka ads an _n suffix to the gene ID.

For example, a splitted genX can then be found as genX_1 and genX_2 in the GFF.

Of course, this can happen due to low assembly quality and errors in the genome that cause frameshifts, etc.. But I also saw this for reference genomes from NCBI (but yeah, they can be also of low quality).

Sometimes the ORFs are even overlapping and from alignments it's quite obvious that they belong together and together would represent the full gene.

I will give a specific example here:

Example

Chlamydia avium 10DC88 type strain

Prokka version used: 1.14.5 Cav_10DC88.log

FASTA: Cav_10DC88.fasta.gz
PROKKA GFF: Cav_10DC88.gff.gz
gene we are looking at now: lpxA

(base) ➜  data git:(dev) ✗ grep lpxA Cav_10DC88.gff 
NZ_CP006571.1   Prodigal:002006 CDS 971572  972096  .   +   0   ID=DEKMBENF_00905;eC_number=2.3.1.129;Name=lpxA_1;db_xref=COG:COG1043;gene=lpxA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q9PIM1;locus_tag=DEKMBENF_00905;product=Acyl-[acyl-carrier-protein]--UDP-N-acetylglucosamine O-acyltransferase
NZ_CP006571.1   Prodigal:002006 CDS 972065  972412  .   +   0   ID=DEKMBENF_00906;eC_number=2.3.1.129;Name=lpxA_2;db_xref=COG:COG1043;gene=lpxA_2;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A722;locus_tag=DEKMBENF_00906;product=Acyl-[acyl-carrier-protein]--UDP-N-acetylglucosamine O-acyltransferase

As you can see, the two ORFs overlap:

lpxA_1 971572   972096 +
lpxA_2 972065   972412 +

Together they span a region of 840 nt

When checking other Chlamydia I get complete hits with a similar length from Prokka (see alignments below).

Here is how (mafft-linsi) alignments with IpxA from other Chlamydia look like: Screenshot from 2020-06-18 14-04-18

And here when I simply cat IpxA_1 and IpxA_2 (which, of course, is not correct because the two ORFs overlap in the genome. But for illustration) Screenshot from 2020-06-18 14-04-33

I have the feeling that the split of IpxA in C. avium is just not correct. And there are also (few) other cases were genes that are in close proximity are split up (sometimes overlapping, sometimes with a few nucleotides in between).

Could this be also a problem with the initial ORF-calling with Prodigal?

I run Prokka with default parameters.

I think a solution to this could be to:

check after the blast step if two (or more) queries (e.g. IpxA_1 and IpxA_2) align to the same target while covering mostly different parts
if so, check if the two (or more) queries are in close proximity and can be glued together while maintaining the ORF structure
if so: glue them

Juke34 commented 4 years ago

When teaching genome annotation this is a recurring question. Christian Brandt pointed me to this issue recently, and I decided to write a first tool to fix the problem. You can find it within AGAT except it is not yet released in an official version, you need to install manually AGAT from the feature/fixorf branch. The approach does not run the blast, and do not analyse blast results. This is something that might be implemented if necessary later. I implemented in a different way to simplify the analysis.

For now it follow those steps:

search for cases where contiguous genes have the same name (e.g. lpxA_1 lpxA_2).
If so we look at the size of the protein of each of those genes (lpxA_1 AA=175 ; lpxA_2 AA=116)
We look at the size of the protein used to infer the name (lpxA_1 inferred from Q9PIM1 = 263 AA ; lpxA_2 inferred from P0A722 = 262 AA ) and I do an average length of the expected protein: 262AA. I add 20% to the length to be sure to include border cases: 282AA.
I check the length of the two proteins (lpxA_1 and ; lpxA_2) together devoid of the overlap if any (here 270 AA). If it is shorter than the expected protein length (282 AA) I continue.
I do not merge the genes yet. I have to check that removing the stop of the first gene and changing the frame at its end (by adding extra N or NN) will allow to continue the ORF until the end of the gene2.
If I can have a long ORF then I merge the genes and modify the output gff and fasta to reflect the changes.

The command is the following:
agat_sp_prokka_fix_fragmented_gene_annotations.pl --gff Cav_10DC88.gff --fasta Cav_10DC88.fasta --db prokka/brokenORF/prokka_bacteria_sprot.fa -o result_folder

For this case it find 16 genes that are then merged into 8 (so 8 dual cases).

The tool should be able to deal with multi fragmentation, not only gene broken into two parts.

nigyta commented 4 years ago

Fragmentation of the IpxA gene might be correct in this case.

Here you can see the original annotation of NZ_CP006571.1, https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP006571.1 It is annotated as peudogene because of the frameshift within the ORF

    CDS             971572..972412
                     /gene="lpxA"
                     /locus_tag="RT28_RS04295"
                     /old_locus_tag="M832_08920"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_011096823.1"
                     /note="frameshifted; Derived by automated computational
                     analysis using gene prediction method: Protein Homology."
                     /pseudo
                     /codon_start=1
                     /transl_table=11
                     /product="acyl-ACP--UDP-N-acetylglucosamine
                     O-acyltransferase"

Therefore, the result from Prokka or Prodigal may not be wrong in this case.

In my opinion, it is more suitable to concatenate the 2 fragmented CDSs with a flag of "/pseudo" as shown above, but modifying the FASTA is not required unless it is a sequencing error ( although it is difficult to tell it is a real mutation or sequencing error).

Juke34 commented 4 years ago

Thank you @nigyta for the feedback. I would be more keen on to believe in a sequencing error than a pseudogene, especially if the assembly is made from long read sequencing (I don't know here). Except if the pseudogene is really new (and we are lucky), other mutations should have been accumulated after this premature stop codon (Here the second part of the gene is intact).

What I can do is to add an extra parameter, and the user can choose if the merged gene has to be annotated as pseudogene, or if we fix the sequence and the annotation to get a proper gene.

hoelzer commented 4 years ago

@Juke34 thanks a lot for looking into this and for the script! I think that's a really good starting point.

In this example, I just used a publicly available Chlamydia genome and I suggest it is just based on short reads. However, we also experienced such cases as described above for long-read/hybrid assemblies.

For sure, the automated differentiation between pseudogenes and fragmented annotations due to sequencing/assembly errors will remain difficult.

I think what you already suggest would be good:

1) your script can function to identify such (potentially) fragmented gene annotations (FRAGS) and provide this information to the user (that's already possible) 2) via a flag FRAGS can be merged into proper genes, if possible according to open reading frame, ... (this could be the default behavior) 3) via a flag merged genes are marked as pseudogenes

Thanks a lot!

Juke34 commented 4 years ago

Thank you for your feedback @hoelzer. I have updated the script to implement your suggestions. There is now the --frags and --pseudo option. By default it just print a report. I also updated the help.

hoelzer commented 3 years ago

@Juke34 I am just finding some time to coming back to this issue and testing your script (once again, thanks a lot!).

One question, your command is:

agat_sp_prokka_fix_fragmented_gene_annotations.pl --gff Cav_10DC88.gff --fasta Cav_10DC88.fasta --db prokka/brokenORF/prokka_bacteria_sprot.fa -o result_folder

but where does the --db come from? It's the database used by Prokka during annotation?

EDIT is it the file https://github.com/tseemann/prokka/blob/master/db/kingdom/Bacteria/sprot ?

Juke34 commented 3 years ago

yes it is Did I forgot to mention it in the help? Edit: No I menhtion it ;) --db Input Uniprot fasta file used by prokka. Mandatory.

hoelzer commented 3 years ago

yes it is Did I forgot to mention it in the help? Edit: No I menhtion it ;) --db Input Uniprot fasta file used by prokka. Mandatory.

thanks! yes, you are right, next time I first screen the code/help more deeply ;)

tseemann / prokka

Fragmented gene annotations #502

Example