soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
149 stars 14 forks source link

SNPs introduced when using high --max-iterations #50

Open Markusjsommer opened 2 weeks ago

Markusjsommer commented 2 weeks ago

Expected Behavior

penguin assembly with few/no SNPs relative to the reads used to assemble.

Current Behavior

high --max-iterations results in SNPs that are not supported by the reads used during assembly

Steps to Reproduce (for bugs)

observable on most samples we tested with --max-iterations 15

also happens with the benchmark rhinovirus data here: https://github.com/AnnSeidel/penguin-analysis/tree/main/benchmarking/rhinovirus-3-mixture on some of the contigs, screenshots attached.

Context

We noticed for larger viral genomes a high --max-iteration helped generate more contiguous assembly. We were surprised to find a large number of SNPs that were not supported when we aligned the reads back to this assembly.

There does not seem to be a magic number for --max-iterations for some assemblies. Either it will not be in 1 piece, or it will have SNPs. Since these SNPs are not supported by reads, we could use a tool like pilon to correct the assembly, but it may be easy/better to fix within penguin to avoid this caveat to assembly accuracy.

See below for the rhinovirus assembly with default settings (few/no SNPs0) and --max-iterations 15 (many SNPs) also zoomed.

Screenshot from 2024-10-31 13-11-35 Screenshot from 2024-10-31 13-11-58 Screenshot from 2024-10-31 13-12-18