soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
149 stars 14 forks source link

general question to gauge dev opinion/advice on selecting proteins for gene phylogenies #43

Closed clb21565 closed 1 year ago

clb21565 commented 1 year ago

Hi all, thanks for the awesome tool! I have been using it to boost the recovery of specific protein families from metagenomes. Using this tool, I found an increased number of antibiotic resistance genes (ARGs), many of which appear to be genuine variants that were not detected by regular assembly. My intent is to select proteins for phylogenetic analysis to profile their distribution in the environment vs. clinical isolates.

However, I can imagine chimeras and spurious substitutions are an issue. I increased the minimum identity to 97 at first (and now am retrying at min identity = 100). Would you have any other words of wisdom, caution, or advice for using this tool for this purpose?

Thank you!

Connor

milot-mirdita commented 1 year ago

I can't comment much on phylogeny, but for post-processing of Plass results we usually run a round of Linclust to collapse most fragments into clusters. For this the target coverage clustering mode at 80% coverage should be quite useful (mmsqes easy-linclust --cov-mode 1 -c 0.8).

Plass allows reusing the same reads in each iterations (i.e. kind of a sampling with replacement), thus it can create quite a bit of variation. But clustering afterwards limits the number of predictions again to a more manageable size.

clb21565 commented 1 year ago

Thanks for the advice! FWIW, I did some in silico evaluations and it seems to do a pretty good job. It's a pretty exciting find for our particular field, so, grateful for the work and tech support. Closing now. Cheers, Connor