Open GeoMicroSoares opened 5 months ago
Happy to see that PenguiN is considered as your potential assembly tool.
From the command calls you used I guess the final contigs from PenguiN's default parameters are too short, so increasing the num-iteration
parameter is a good idea. However, I don't know your data but first of all I would like to point out that it might make sense to only increase the number of iterations at the nucleotide level as the proteins are usually already assembled to full length with few iterations and therefore there is no advantage to run more iterations in both stages but it increases the risk of redundancy. You can set the number of iterations separately using--num-iterations aa:5,nucl:10
(or depending on your data even higher values)
Regarding 1) You are right aligned concordantly>1 is an indicative of multimapping. However if it’s a good or bad sign is difficult to say. With PenguiN we aim to resolve also very closely related strains, whereas metaSPAdes reconstruct a consensus assembly of a strain mixture. Depending on the mapping sensitivity reads might map to multiple correctly assembled strain contigs for the PenguiN assembly.
On the other hand, it must also be said that PenguiN's approach carries the risk of producing redundant contigs. To overcome the issue of dead ends in low coverage regions during the greedy iterative assembly strategy, PenguiN (and Plass) re-uses reads. More precisely, different contigs can be extended with the same read. In principle the same genomic region can be built multiple times in parallel. We introduced a few ideas to minimize the effect however it cannot be prevented completely. This is why we integrated the Linclust algorithm [Steinegger and Söding, 2018] in PenguiN as the last step and only output the cluster representatives as final contigs. However Linclust's speed comes with at the expense of some loss in sensitivity. In cases where redundancy is problematic, I suggest using a more sensitive all-against-all clustering in a post-processing step after the assembly. In our Paper benchmarks, we used for example an additional clustering step using the nucleotide clustering workflow of the MMseqs2 software suite.
Regarding 2) At the moment we are not working on a scaffolding module. However, we have already thought about it
Hi there,
Congratulations on your tool - I'm really excited about PenguiN as this could be an interesting alternative to explore. As such, I've set out to compare it our group's gold-standard for environmental metagenomics, metaSPAdes and am getting some really interesting data that maybe you could help me interpret to see if we should consider changing to using PenguiN or not?
Here's how everything has been run so far on an example environmental metagenome:
metaSPAdes:
PenguiN:
PenguiN_wmods:
Looking at the log files here's what I see:
Is this something you see a lot in your experience? In principle, I'd say that higher percentages of 'aligned concordantly >1 times' should be indicative of multimapping and thus not a good sign?
Here's a quick plot of median/mean contig lengths (scaffolds for metaSPAdes), with standard deviations as vertical lines from each point:
This makes sense when looking at length frequency distributions for each assembly:
Do you contemplate adding a scaffolding module to PenguiN? I wonder how these values could change with that!
I think there's a lot of potential in PenguiN - I'm still reading up on it, but will take any insights you're willing to offer as you look at this data! I can also share rps3 taxonomic profiles I've run on each assembly if you'd want.
Thanks in advance for the attention!