gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
264 stars 33 forks source link

panaroo merging issues #192

Closed pamelacamejom closed 2 years ago

pamelacamejom commented 2 years ago

Hi, I a using the "merge" option of panaroo to combine two pan-genomes created from two datasets. However, when I see the results, some sequences from both pan-genomes that are >98% identical (same length), are assigned to two different clusters after merging (one cluster with samples from one pangenome and a second cluster with samples from the other). Also, the number of genes in the pangenome fasta file does not match the number in the gene_presence_absence.csv file. In these cases, only one of the two clusters is included in the fasta file.

gtonkinhill commented 2 years ago

Hi,

Both of these issues are likely to be caused by paralogous genes.

Please let me know if you have any other questions or if this does not explain what you are observing.

pamelacamejom commented 2 years ago

Thanks for your answer, but maybe I did not explain myself correctly. There are some genes, present only once in each genome (there is no paralog), whose gene sequences from all samples cluster together within each pangenome. When I merge the pan-genomes, I would expect these two clusters to be merged, resulting in one cluster with sequences from all the samples. However, I end up with two clusters, one containing sequences from pangenome 1, and the other containing sequences from pan genome 2. This is an example of how the gene_presence_absence.csv file looks like after merging:

tilS_2~~~tilS~~~tilS_1,tilS_2;tilS;tilS_1,tRNA(Ile)-lysidine synthase,LIMBAPGE_03845,BELKIDDK_02193,,

These samples are from the first pangenome

tilS,tilS,tRNA(Ile)-lysidine synthase,,,EPAHOCBL_01337,DFDOOAFK_02702

These samples are from the second pangenome

When I check the sequences, they all have > 98% identity (between pan genomes too).

pamelacamejom commented 2 years ago

I tried running panaroo´s original command to create one pangenome with all these samples (from pangenome 1 and 2). When I do that, these genes that I mentioned get clustered together. I believe there are issues with the "merging" step of panaroo.

gtonkinhill commented 2 years ago

Thanks for getting back to me. If it is possible, it would be very helpful if you could provide a small reproducible example. This would help a lot in diagnosing the issue.

pamelacamejom commented 2 years ago

Here are the links to the two pan genomes that I am trying to merge: https://drive.google.com/open?id=1DX3TJrVYmZ-cAh3OFCjBbI4M9dkx4nBO&authuser=pcamejo%40pht.cl&usp=drive_fs https://drive.google.com/open?id=1DWx90c8inSx7jItR6UvVMGWoWx8wwfk-&authuser=pcamejo%40pht.cl&usp=drive_fs

After merging, I usually check the cluster tilS~tilS_1~tilS_2 from pangenome 1 and tilS from pangenome 2. This cluster should be merged.

pamelacamejom commented 2 years ago

Hi, I wonder if you had the chance to take a look at these files.

gtonkinhill commented 2 years ago

Hi, sorry for the delay. I have not had a chance yet but will try my best to take a look this week.

gtonkinhill commented 2 years ago

Hi,

I have taken a look at your graphs. It turns out this issue was caused by how the merge script handles paralogous genes.

In the main Panaroo algorithm, it is guaranteed that a paralogous gene cluster will appear at most N times where N it the maximum number of times it appears within a single genome.

As the merge script was mainly designed for merging large diverse graphs it does not include this guarantee. In practice this is rarely a problem as the merge script will still combine paralogous clusters if they appear in similar regions of the genome graph. However, if a gene appears in very different regions the script can fail to merge these paralogous clusters.

In your case, although the two clusters you identified did not include the same genome, there was a third cluster that did which led to these genes being classified as paralogous. This gene had the centroid id LIMBAPGE_03845.

I have copied a screen shot of the subset of the merged graph below with the three clusters highlighted. As you can see they all appear in different areas of the graph which is why Panaroo has not merged them.

Screen Shot 2022-09-12 at 9 37 52 AM

A potential solution to this would be to run the merge script and combined paralogs using the --merge_paralogs flag.

I hope this helps and sorry for taking a little long to get back to you. If you would like me to send through the merged graph I generated please just let me know.

pamelacamejom commented 2 years ago

Hi Gerry,

Thanks for looking into this.

I checked this gene´s neighbourhood in both set of samples, and it seems to be the same (see example at the end). In our case, we are not interested on merging paralogs, since we already set up some features to use for a ML model and we are looking for the same features in this second set of samples. If you have any other recommendation about how to do this merging step I would really appreciate your help.

Pamela.

Sample from pangenome1:

21 Prodigal:002006 CDS 7811 8356 . - 0 ID=ANFILNGA_03482;Name=yaeQ;db_xref=COG:COG4681;gene=yaeQ;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0AA97;locus_tag=ANFILNGA_03482;product=putative protein YaeQ 21 Prodigal:002006 CDS 8554 8754 . + 0 ID=ANFILNGA_03483;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A8K7;locus_tag=ANFILNGA_03483;note=UPF0253 protein YaeP;product=hypothetical protein 21 Prodigal:002006 CDS 8741 9001 . + 0 ID=ANFILNGA_03484;Name=rof;db_xref=COG:COG4568;gene=rof;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0AFW8;locus_tag=ANFILNGA_03484;product=Protein rof 21 Prodigal:002006 CDS 9085 10377 . - 0 ID=ANFILNGA_03485;eC_number=6.3.4.19;Name=tilS;db_xref=COG:COG0037;gene=tilS;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P52097;locus_tag=ANFILNGA_03485;product=tRNA(Ile)-lysidine synthase 21 Prodigal:002006 CDS 10440 10829 . - 0 ID=ANFILNGA_03486;inference=ab initio prediction:Prodigal:002006;locus_tag=ANFILNGA_03486;product=hypothetical protein 21 Prodigal:002006 CDS 10885 13026 . - 0 ID=ANFILNGA_03487;eC_number=4.1.1.18;Name=ldcC;db_xref=COG:COG1982;gene=ldcC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P52095;locus_tag=ANFILNGA_03487;product=Constitutive lysine decarboxylase 21 Prodigal:002006 CDS 13102 14865 . - 0 ID=ANFILNGA_03488;inference=ab initio prediction:Prodigal:002006;locus_tag=ANFILNGA_03488;product=hypothetical protein

Sample from pangenome2:

5 Prodigal:002006 CDS 7635 8180 . - 0 ID=EPAHOCBL_01334;Name=yaeQ;db_xref=COG:COG4681;gene=yaeQ;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0AA97;locus_tag=EPAHOCBL_01334;product=putative protein YaeQ 5 Prodigal:002006 CDS 8378 8578 . + 0 ID=EPAHOCBL_01335;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0A8K7;locus_tag=EPAHOCBL_01335;note=UPF0253 protein YaeP;product=hypothetical protein 5 Prodigal:002006 CDS 8565 8825 . + 0 ID=EPAHOCBL_01336;Name=rof;db_xref=COG:COG4568;gene=rof;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P0AFW8;locus_tag=EPAHOCBL_01336;product=Protein rof 5 Prodigal:002006 CDS 8909 10201 . - 0 ID=EPAHOCBL_01337;eC_number=6.3.4.19;Name=tilS;db_xref=COG:COG0037;gene=tilS;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P52097;locus_tag=EPAHOCBL_01337;product=tRNA(Ile)-lysidine synthase 5 Prodigal:002006 CDS 10265 10654 . - 0 ID=EPAHOCBL_01338;inference=ab initio prediction:Prodigal:002006;locus_tag=EPAHOCBL_01338;product=hypothetical protein 5 Prodigal:002006 CDS 10710 12851 . - 0 ID=EPAHOCBL_01339;eC_number=4.1.1.18;Name=ldcC;db_xref=COG:COG1982;gene=ldcC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P52095;locus_tag=EPAHOCBL_01339;product=Constitutive lysine decarboxylase 5 Prodigal:002006 CDS 12927 14690 . - 0 ID=EPAHOCBL_01340;inference=ab initio prediction:Prodigal:002006;locus_tag=EPAHOCBL_01340;product=hypothetical protein

On 11-09-2022, at 20:41, Gerry Tonkin-Hill @.***> wrote:

Hi,

I have taken a look at your graphs. It turns out this issue was caused by how the merge script handles paralogous genes.

In the main Panaroo algorithm, it is guaranteed that a paralogous gene cluster will appear at most N times where N it the maximum number of times it appears within a single genome.

As the merge script was mainly designed for merging large diverse graphs it does not include this guarantee. In practice this is rarely a problem as the merge script will still combine paralogous clusters if they appear in similar regions of the genome graph. However, if a gene appears in very different regions the script can fail to merge these paralogous clusters.

In your case, although the two clusters you identified did not include the same genome, there was a third cluster that did which led to these genes being classified as paralogous. This gene had the centroid id LIMBAPGE_03845.

I have copied a screen shot of the subset of the merged graph below with the three clusters highlighted. As you can see they all appear in different areas of the graph which is why Panaroo has not merged them.

https://user-images.githubusercontent.com/10625107/189553756-cdbe2ec0-2725-41e4-9877-90161d4fa865.png A potential solution to this would be to run the merge script and combined paralogs using the --merge_paralogs flag.

I hope this helps and sorry for taking a little long to get back to you.

— Reply to this email directly, view it on GitHub https://github.com/gtonkinhill/panaroo/issues/192#issuecomment-1243069476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWQF7OEDQCHVBS63ZSE6XYTV5ZUZLANCNFSM57TPFPBA. You are receiving this because you authored the thread.

gtonkinhill commented 2 years ago

Hi Pamela,

The problem I described is extended beyond the gene you mentioned as there is a long stretch of paralogous genes that occur in the GCF_001447115.1_ASM144711v1_genomic genome. I have copied a small manually aligned section of these problematic genes below.

To define 'nearby' genes Panaroo uses single copy anchor genes. It starts at these genes and iteratively merges nearby homologous clusters as long as they do not involve paralogs from the same genome. This very long stretch of paralogs means that the iteration does not reach this area of the genome graph.

In the main Panaroo program these regions are subsequently resolved in a computationally expensive follow on step.

As this issue appears to be quite rare in your dataset, one solution would be to simply remove the GCF_001447115.1_ASM144711v1_genomic genome.

I am not sure what you are using your ML model for but in general I have found collapsing paralogs to be preferable in these kinds of applications.

Screen Shot 2022-09-13 at 12 07 19 AM

pamelacamejom commented 2 years ago

Thanks Gerry,

I am still confuse about the genes you’re talking about.

I think in the graph you sent me, the genes “tilS-tilS_1-tilS_2” (from pangenome 1) and “tilS” (from pangenome 2) are the ones that should be merged (they also share the same gene context as I showed you in my previous email). The cluster group_21534 is just a paralog of “tilS-tilS_1-tilS_2”, which should not be merged with this gene.

Pamela.

On 12-09-2022, at 11:08, Gerry Tonkin-Hill @. @.>> wrote:

Hi Pamela,

The problem I described is extended beyond the gene you mentioned as there is a long stretch of paralogous genes that occur in the GCF_001447115.1_ASM144711v1_genomic genome. I have copied a small manually aligned section of these problematic genes below.

To define 'nearby' genes Panaroo uses single copy anchor genes. It starts at these genes and iteratively merges nearby homologous clusters as long as they do not involve paralogs from the same genome. This very long stretch of paralogs means that the iteration does not reach this area of the genome graph.

In the main Panaroo program these regions are subsequently resolved in a computationally expensive follow on step.

As this issue appears to be quite rare in your dataset, one solution would be to simply remove the GCF_001447115.1_ASM144711v1_genomic genome.

I am not sure what you are using your ML model for but in general I have found collapsing paralogs to be preferable in these kinds of applications.

https://user-images.githubusercontent.com/10625107/189675106-6aa5a0d9-c29b-42de-81cf-87efba971c95.png — Reply to this email directly, view it on GitHub https://github.com/gtonkinhill/panaroo/issues/192#issuecomment-1243798138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWQF7OE2XRIRCEEXVIHECF3V542PLANCNFSM57TPFPBA. You are receiving this because you authored the thread.

gtonkinhill commented 2 years ago

Hi Pamela,

Yes, in an ideal world the clusters would merge as you describe. However, as all three clusters are very similar they all get classified as paralogs. The very long track of paralogs in GCF_001447115.1_ASM144711v1_genomic then causes issues with the merge script as it does not include the more computationally intensive step to resolve these.

I will look at adding this as an option in the future but I would recommend that filtering out GCF_001447115.1_ASM144711v1_genomic might deal with you issue in the short term.

pamelacamejom commented 2 years ago

Sorry to bother you again. Removing that genome indeed fixed the problem with the tilS cluster. However, I still noticed other clusters which are not merged probably because they are paralogs (e.g. argR_2).

Are you thinking on incorporating that extra step in the merge function soon? If not, would you show me the specific functions that needs to be added?

Thanks.

Pamela.

On 15-09-2022, at 10:17, Pamela Camejo @.***> wrote:

Thanks for your help. Removing that genomes from the analysis worked.

On 13-09-2022, at 04:03, Gerry Tonkin-Hill @. @.>> wrote:

Hi Pamela,

Yes, in an ideal world the clusters would merge as you describe. However, as all three clusters are very similar they all get classified as paralogs. The very long track of paralogs in GCF_001447115.1_ASM144711v1_genomic then causes issues with the merge script as it does not include the more computationally intensive step to resolve these.

I will look at adding this as an option in the future but I would recommend that filtering out GCF_001447115.1_ASM144711v1_genomic might deal with you issue in the short term.

— Reply to this email directly, view it on GitHub https://github.com/gtonkinhill/panaroo/issues/192#issuecomment-1244986554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWQF7OGUNTNJPQ3F5GMOZYTV6ARMRANCNFSM57TPFPBA. You are receiving this because you authored the thread.