rrwick / Trycycler

A tool for generating consensus long-read assemblies for bacterial genomes
GNU General Public License v3.0
306 stars 28 forks source link

Challenges recreating demo datasets #36

Closed johnsonj161 closed 2 years ago

johnsonj161 commented 2 years ago

I am hoping to get more insight into how you created your assemblies for the demo read sets. I am trying to replicate the contig clusters you generated for these datasets using assemblies generated from Flye, Miniasm+Minipolish, and Raven but am running into issues with the plasmid assembly in each case. My general process is shown below:

Step 1. Process reads with Filtlong filtlong --min_length 1000 --keep_percent 90 reads.fastq.gz | gzip > filtered_reads.fastq.gz

Step 2. Create read subsets trycycler subsample --reads filtered_reads.fastq.gz --out_dir subsets --count 12

Step 3. Create sub-assemblies using Flye, Miniasm+Minipolish, and Raven _# run Flye for read subsets sample_01.fastq to sample04.fastq flye --nano-raw subsets/sample_01.fastq --out-dir flye/sample_01

_# run Miniasm+Minipolish for read subsets sample_05 to sample08 minimap2 -x ava-ont subsets/sample_05.fastq subsets/sample_05.fastq > miniasm/sample_05/overlaps.paf miniasm -f subsets/sample_05.fastq miniasm/sample_05/overlaps.paf > miniasm/sample_05/assembly.gfa minipolish subsets/sample_05.fastq miniasm/sample_05/assembly.gfa

_# run Raven for read subsets sample_09 to sample12 raven subsets/sample_09.fastq > assemblies/sample_09.fasta

Step 4. Cluster contigs trycycler cluster --assemblies assemblies/*.fasta --reads filtered_reads.fastq --out_dir clusters

Below are the trees from the great, good, and mediocre demo datasets: Note that there are additional Raven assemblies included in the figures due to an error in my pipeline.

Great Dataset great_dataset_tree

Good Dataset good_dataset_tree

Mediocre Dataset mediocre_dataset_tree

In each case, I am having issues resolving plasmids (though the great dataset is passable). This contrasts your examples for the same datasets. Do you have any ideas why my results might be different? For what it is worth, I did try running this pipeline without Filtlong and it did not solve the issue. Any help is appreciated!

rrwick commented 2 years ago

The fact that your trees look different to mine isn't a cause for concern, and I can think of a few reasons why this is the case:

Regarding the plasmid, yes, they are often troublesome to resolve! I'd say that cluster_2 in your good dataset tree isn't too bad, though one contig is incomplete and two are double/tripled and will need some manual repair. The mediocre dataset doesn't have anything useable for the plasmid.

If you encounter a mediocre case like this in the real world, you would ideally resequence to get better reads. If that's not an option, you can try fiddling with the assembly parameters to see if you can get cleaner clusters.

Hope that helps! Ryan