torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
125 stars 23 forks source link

Pacbio reads #81

Closed davidvilanova closed 7 years ago

davidvilanova commented 8 years ago

Is swarm suitable for long pacbio reads ?

thanks,

frederic-mahe commented 8 years ago

Hi @davidmartinad200,

Because of pacbio's high-error rate and low-sequencing depth, swarm might not be the ideal tool. To confirm that, I would need to work on pacbio amplicon data (I never had the opportunity).

davidvilanova commented 8 years ago

I´m working with 16S data. Error rate is being reduced by generated consensus sequences from Pacbio reads with at least 9 full passes and by having full length databases. Sequencing depth is not a problem because even if you get less reads they are longer and cover high regions where you would need much more small reads to cover, so that´s not really a problem. (see paper https://peerj.com/articles/1869/ ). My guess is that swarm should work pretty well (actually i have tried with a demo MOCK and was satisfied with the results). Just wondering if there would be any special parameter to tweak or not.

frederic-mahe commented 8 years ago

Hi @davidmartinad200,

that sounds very interesting. The multi-pass approach to reduce overall error rates is great. My concern with sequencing depth is that swarm works better if you have a lot of sequences. "True" sequences and their clouds of micro-variants constitute a landscape where the altitude corresponds to the frequence (i.e. the copy number) of each sequence. To get copy-number, you need sequencing depth. However, I checked, pacbio seems to produce enough reads, so that should not be a problem for swarm.

Regarding parameters, I would suggest to try -f -d 1 first. Starting from there, you can try -d 2 or -d 3 if you want to sacrifice some resolution to further reduce the number of singletons (although, there are better ways to do that, such as quality filtering and sample replicates). Swarm was designed to give results not-very-dependent from parameters, so there is not much to tweak with.

If you try swarm on long fragments, I'll be interested to get your feedback. I know swarm works well on amplicons ranging from 75 bp (old Illumina GAxII) to 400 bp (Illumina MiSeq v3). I'd be very happy to get a confirmation it also works on longer amplicons.

davidvilanova commented 8 years ago

Yes, as i said we have tested with 16s data from a synthetic mock and swarm has performed well, we have recovered our expected species. Our amplicons were in between 1000 and 1600bp long. david

frederic-mahe commented 7 years ago

I am going to close that issue. Feel free to open it again if you want to add to the conversation.