pierrebarbera / epa-ng

Massively parallel phylogenetic placement of genetic sequences
GNU Affero General Public License v3.0
78 stars 7 forks source link

Placements sensitive to outlier sequences? #33

Open gavinmdouglas opened 4 years ago

gavinmdouglas commented 4 years ago

I recently noticed that the placements of all query sequences can be affected by the presence of a small number of outlier sequences when placing onto a large tree. This problem appears to be especially affect the pendant length estimates and less so the edge placements. I noticed this issue when running different subsets of a dataset with PICRUSt2, which wraps EPA-NG.

I've reproduced this for query datasets of 323 sequences into a tree of 20,000 sequences. In this example only one query sequence differs between the test datasets (essentially in one case the query sequence doesn't align the reference).

In the original case the focal query sequence alignment looks like this:

>02905cfb87861c837dde629596d9272b
....-----GGTCTTGACATC--CCTCT-GACGAGTGAGTAATGT-CG--CTT--T-C--CC--T----T----C--GG---------G------G--C-A-G-A-GGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCTTTAGTAGCCAGCA----GTAAGA-----TGGGAACTCTAGAGAGACTGCCGGGGATAACCCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCAGGGCTACACACGTGCTACAATGGC-G-T-A-A-ACAGAGGGA-AGCGACCCTGTGAAGGTAAGCAAATCCCAAAAATAACGTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGA-ATGTCGCGGTGAATACGTTCCCGG----...

I swapped in random DNA for a different test file (with the same header) and the alignment looks like this (only a single base aligned):

>02905cfb87861c837dde629596d9272b
....--------------T---------------....

You can see in the jplace outputs of running EPA-NG that there are many differences in the placements, particularly in the pendant distances.

I think this issue might be related to issue https://github.com/Pbdas/epa-ng/issues/29. I didn't expect all placements to be affected by a single weird sequence. I wasn't able to reproduce this issue with the example datasets used in the EPA-NG paper and I'm thinking that maybe this issue only arises with large trees. Any insight would be greatly appreciated!

I ran EPA-NG with this command:

epa-ng --tree pro_ref.tre \
       --ref-msa ref_seqs_hmmalign.fasta \
       --query STUDY_SEQS \
       --chunk-size 5000 \
       -T 20 \
       -m pro_ref.model \
       -w epa_out \
       --filter-acc-lwr 0.99 \
       --filter-max 100

You can see the input and output files attached in this zipfile: placement_test.zip.

STUDY_SEQS corresponds to study_seqs_hmmalign_original.fasta and study_seqs_hmmalign_funky.fasta for the original dataset and the dataset with the outlier sequence, respectively. The output jplace files are named the same way.

Thanks,

Gavin

pierrebarbera commented 4 years ago

Hi Gavin,

thank you for finding this! it could very well be related to #29 as you say... though even then it seems strange. My initial thought is there might be some side effects of the optimization functions, though I thought I had properly isolated the relevant code, even to the (slight) detriment of performance.

Early in the new year, when my current paper is finally submitted, I will most likely have another go at epa to solve some issues and include a bunch of useful functionality, such as the issues with memory handling... I'm afraid you will have to stay tuned until then.

Cheers, Pierre

gavinmdouglas commented 4 years ago

Thanks for looking into it Pierre! I should have mentioned that I could reproduce this problem with both EPA-ng v0.3.5 and v0.3.6.

Best,

Gavin

brendanf commented 10 months ago

I am confused... My impression was that the placement of each query sequence by EPA-ng was completely independent from all the other query sequences. Is this just a bug which is becomes visible for large trees and/or weird query sequences, or is there actually some kind of intentional information sharing between queries? This is particularly relevant for me because I have been assuming that it makes no difference for the results to split up my queries (which I have millions of) into smaller batches and run epa-ng on them independently.