ropensci / phylotaR

An automated pipeline for retrieving orthologous DNA sequences from GenBank in R
https://docs.ropensci.org/phylotaR
Other
23 stars 8 forks source link

Identical clusters making it into alignments #71

Open avaghezel opened 4 months ago

avaghezel commented 4 months ago

Hello,

I'm unclear if this is an issue or if it's intentional, but phylotaR is consistently resulting in duplicate or near-duplicate clusters in my analyses. Every run I've performed has each gene represented in at least two clusters, with each of these duplicate clusters containing sequences for the same taxa. They're perhaps different haplotypes, but from what I've checked, they tend to BLAST to the same NCBI sequence accession. Occasionally there will be minor differences in the alignment (e.g. two 18S cluster will have 1822 bp while a third consists of ten fewer bp, despite having the same taxa represented), but I'm assuming that's a minor MAFFT issue and I don't anticipate it having a major impact on any phylogenetic inferences.

I am concerned about the general pattern of having for e.g. 20 clusters when really I should only have 6--it does impact the inferred relationships for more poorly supported nodes. It's easy enough to examine and drop clusters for small datasets, but not at the larger scale for which I was considering to use the program. Is this a bug, or is it intentional for some reason? If the latter, where in the code can I adjust the strictness of the clustering algorithm? I feel like I've looked really thoroughly through your files (which are really nicely annotated by the way! super helpful) and just can't find what's going wrong / what to adjust.

Thank you for your help! Ava

ShixiangWang commented 4 months ago

@avaghezel Hi Ava,

Thanks for your report. I am the maintainer of this package but not the original developer. For investigating your issue, a reproducible example is required so I can reproduce, understand the issue, and debug the code.

Best, Shixiang