torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

amplicon sequence reconstruction with swarm? #147

Closed KabitaBaral1 closed 4 years ago

KabitaBaral1 commented 4 years ago

Hello,

I am currently working on a genotyping project that involves the use of highly variable gene markers. We used the Nextera kit for the library preparation prior to illumina sequencing in paired-end mode (2x250 bp). I was wondering if Swarm performs a sequence reconstruction of my amplicons from my sheared reads (my marker size is about 850 bp). We are planning to label each cluster as one genotype across pooled samples (each sample is prone to have multiple genotypes). I was reading the study you published of Swarm, but I wasn't able to find anything related to sequence reconstruction from random breaks introduce by Nextera or a similar fragmentation approach. Is there any way I could use Swarm to solve this issue?

Thanks in advance,

Kabita Baral

frederic-mahe commented 4 years ago

Hi Kabita, if I understand correctly you have amplified long fragments (800-900 bp) of a marker, and these long fragments were randomly sheared pre-sequencing. You ended up with 2x250 bp read pairs, covering random parts of the long fragments, and you would like to reconstruct the fragments in silico.

I must say I have never had to deal with such data. I don't think swarm would be suited for that task. Maybe you could use vsearch's search_global function to select largely overlapping reads (without mismatches) to reconstruct full-length markers. Then you could map the raw-reads to these full-length markers to get abundance counts. But again, I am not sure I have a clear understanding of the data.

colinbrislawn commented 4 years ago

Hello Kabita,

I think Frédéric summarized this well. Because no single read covers the full region, your task is closer to genomic assembly. You might try a tool like spades for this task.

Panel A of this figure illustrates one of the problems you will face.

de Bruijn graph examples As you can see, this sequence starts and ends the same, but there are multiple ways it can be combined in the middle. If I were doing this, I wouldn't want just one for these sequences. I would want every possible read that could be seen here (every walk through the graph). And assembly tools will let you get that.

Once you have reconstructed your amplicon, vsearch is a great tool for mapping reads against it.

KabitaBaral1 commented 4 years ago

Hey Torognes/Swarm,

Thank you very much for your help. I appreciate it.

Sincerely, Kabita Baral MS Bioinformatics The University of Texas at El Paso

On Mon, Oct 21, 2019 at 12:25 PM Colin Brislawn notifications@github.com wrote:

Hello Kabita,

I think Frédéric summarized this well. Because no single read covers the full region, your task is closer to genomic assembly. You might try a tool like spades http://cab.spbu.ru/software/spades/ for this task.

Panel A of this figure https://genome.cshlp.org/content/19/2/336/F2.expansion.html illustrates one of the problems you will face.

[image: de Bruijn graph examples] https://camo.githubusercontent.com/2db8f4d118c76574408ac10951e661b1e50613cd/68747470733a2f2f67656e6f6d652e6373686c702e6f72672f636f6e74656e742f31392f322f3333362f46322e6c617267652e6a7067 As you can see, this sequence starts and ends the same, but there are multiple ways it can be combined in the middle. If I were doing this, I wouldn't want just one for these sequences. I would want every possible read that could be seen here (every walk through the graph). And assembly tools will let you get that.

Once you have reconstructed your amplicon, vsearch https://github.com/torognes/vsearch/ is a great tool for mapping reads against it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/torognes/swarm/issues/147?email_source=notifications&email_token=ANRGDHYWNLFTJ6T3EDDY3DLQPXX2HA5CNFSM4JCX5SXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3J4PI#issuecomment-544644669, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANRGDH2T6K326GS2AQ7522TQPXX2HANCNFSM4JCX5SXA .

frederic-mahe commented 4 years ago

Thanks Colin for your input.