gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
261 stars 33 forks source link

Refound Genes Exhibit In-Frame Stops and Frameshifts: Seeking User-Friendly Controls and Improved Understanding for Downstream Analysis #263

Closed NickJD closed 5 months ago

NickJD commented 8 months ago

This issue post is for prosperity and has already been reported.

I have been using Panaroo for a while now but I am currently trying to take the DNA/Protein sequences reported in the combined_DNA/protein_CDS.fasta and pan_genome_reference.fa files and do some further downstream analysis with them using tools such as CDHIT. When doing this I noticed quite a high number of the refound genes had inframe stops (*). I am aware of the basics of how the refound system works (using BLAST to find unannotated genes present in most genomes but not in some - possibly due to annotation/assembly error etc).

I know from reading some of the issues on GitHub that this has partly already been raised (see below image and https://github.com/gtonkinhill/panaroo/issues/65) but I have had a couple of runs which have resulted in some output I can’t quite figure out.

image

Below is an example of 5 refound genes which exhibit 2 different types of frameshift mutations. The first 4 have a removal of a ‘G’ and the last one has the addition of 2 ‘G’s. Both of these ‘mutations’ result in different gene lengths from the canonical genes (1,116 - canonical, 1,115 - missing ‘G’, 1,118 - additional ‘G’s). This is not a surprise as BLAST is gap/frame/codon agnostic.

image

This has resulted in the pan_genome_reference.fa file reporting the 24_refound_7533 sequence as the reference for this group/cluster but due to the frameshift, the amino acid sequence has multiple inframe stops and a very different sequence compared to the other sequences in this cluster.

image

As can be see in the above image that where the frameshifts occur, it completely breaks the continuity of the refound sequence in the amino acid space. To get these sequences, I got the IDs from the final_graph.gml file for the group/cluster, extracted the DNA sequences from ‘combined_DNA_CDS.fasta’ and converted to amino acid using table 11.

I ran (default parameters) panaroo on another set of genomes to see if this issue persisted - below are some examples.

>0_refound_0

IDLTFLWVLFRMSYKSSTLAWFLVNQKDV*X

>0_refound_1

MQNNRLTFMERLSERLTSVEAILKKLDPIESLLERIALLEKNIYX

Not every refound gene is like this but many are. Why do the refound genes end with an X? - Is this the uncertainty codon -> amino acid because of the frameshift making the sequence no longer modular 3? Most of the stops were like the first one *X at the end of the predicted proteins but a few had them scattered throughout the protein sequence.

To me, there is no obvious pattern.

>0_refound_22

MHKAPTFTSRGFVRKRKKGCILTHLQRIGTGHVKTASGYVSENARYRVVRREGCPLRCRCFKAKGNRTIELNHRLRKYRQNAKELLCSEEGLKHRGQSCIEPEAVFEQMKYNMNYKRFRHFGKDKVFMDFAFFAVAFNIKKMCAKMAKEGIDWLIKRFYELTVAIFRCCEHISKYCSLKX  <-- X at the end no *

>0_refound_23

MDALFQTQSNELSTTTQKEKLFKDAERQWNTLSTEKGAPGRKVDFEIYKSILYNLEIRKVYLDSKLSLEKFSSIV   <-- no X at the end

>0_refound_24

CVGKMPDMEGLRKETVL*SVMDL*WRKCSRRNLWG*SLLLRIKSERTIVATGEYTQEVRYYVTSLDNTRPEKIASAIRQHWSIGNNLHWQLDVTFREDYSKKVKNAAGNFSVATKMALTTLKNEKTTKGSMNLKRLKAGWDENYLSQLLQDNNF*X  <-- X at the end with an , no internal X's next to the *

I have a few concerns regarding this. Not only do these frameshifted/pseudogenes inflate the number of core/soft-core etc gene families, but I do not see this being reported to the user.

panaroo -i GC*/GC*.gff3 --clean-mode strict -a core --aligner clustal --core_threshold 0.98 --remove-invalid-genes -t 30 -o xxx_Panaroo>

Core genes  (99% <= strains <= 100%)    2501

Soft core genes (95% <= strains < 99%)  143

Shell genes (15% <= strains < 95%)  2887

Cloud genes (0% <= strains < 15%)   7414

Total genes (0% <= strains <= 100%) 12945

 panaroo -i GC*/GC*.gff3 --clean-mode strict -a core --aligner clustal --core_threshold 0.98 --remove-invalid-genes --search_radius 0 -t 30 -o xxx_Panaroo2>

Core genes  (99% <= strains <= 100%)    2366

Soft core genes (95% <= strains < 99%)  213

Shell genes (15% <= strains < 95%)  2954

Cloud genes (0% <= strains < 15%)   7639

Total genes (0% <= strains <= 100%) 13172

– I think there should be a way to formally and completely turn off the Refound gene option for those who only want to rely on the annotations they provide. The current options below do not allow for this and still return Refound genes.

Refind:

 --search_radius SEARCH_RADIUS

            the distance in nucleotides surronding (**<-Spelling Error**) the neighbour of an accessory gene in which to search for it

 --refind_prop_match REFIND_PROP_MATCH

            the proportion of an accessory gene that must be found in order to consider it a match

Lastly, those who use these gene sequences, from each group/cluster for a phylogenetic analysis (or any other analysis) would potentially get very different results if they were to use the DNA/Amino Acid sequences without being aware. Additionally, I think refound genes with insertions are likely to be chosen as the group representatives as they are the longest of the sequences and often the longest are chosen by clustering algorithms.

Other points:

gtonkinhill commented 7 months ago

This has now been addressed in v1.4.*

fwhelan commented 7 months ago

Hi gtonkinhill,

Thanks for including this in v1.4.0. Could I please clarify, is this addressed by including the flag --search_radius 0 to avoid the inclusion of re-found genes?

Thanks!

gtonkinhill commented 7 months ago

Hi Fiona,

It's enabled using the new option --refind_strict. Thanks for seeking clarification. I will update the documentation to include this option.

fwhelan commented 7 months ago

Thank you!

fwhelan commented 7 months ago

Hi gtonkinhill, I gave this a try but there are still refound genes included in my output; have I missed something?

panaroo -t 20 -i *.gff -o pan_out/ --clean-mode strict --refind_strict spotted in stdout: ... Number of searches to perform: 2379297 Searching... 209it [07:15, 2.08s/it] translating hits... removing by consensus... Updating output... Number of refound genes: 2341 collapse gene families with refound genes... Processing depth: 1 ... grep "refound" pan_out/gene_presence_absence.csv | wc -l 1376 panaroo --version panaroo 1.4.2

Thanks for your help!

gtonkinhill commented 6 months ago

Hi Fiona,

The --refind_strict option just excludes refound genes that are not 'valid'. That is, it excludes genes with incorrect lengths or premature stop codons which may be the result of misassemblies.

As you're the second person who is interested in having the option to completely turn off the refinding step. I will look at adding this to the next release.

fwhelan commented 6 months ago

Hi Greg, Ah I see- that makes sense! Thanks for considering adding this as an option.

gtonkinhill commented 5 months ago

The latest release (v1.5.0) includes an option to turn off re-finding completely. Thanks for the suggestion!