Ability to process fastq files

torognes / swarm

A robust and fast clustering method for amplicon-based studies

GNU Affero General Public License v3.0

123 stars 23 forks source link

Ability to process fastq files #60

Closed deprekate closed 9 years ago

deprekate commented 9 years ago

It would be helpful if SWARM could process fastq files. Even if it did not use the quality score information, and still used the raw nucleotides (incorporating quality scores is large undertaking).

frederic-mahe commented 9 years ago

Hi @deprekate,

Here is what I normally do for my own multi-sample analyses:

assemble paired-ends FASTQ files (pear),
trim primers from reads (cutadapt),
remove reads with Ns,
dereplicate at the sample level (vsearch),
dereplicate at the study level (i.e. pool samples) (vsearch),
swarm with d = 1 and the fastidious option

In my own experience, I always have to assemble paired-end reads with pear, and I always have to trim primers and adaptors from sequences with cutadapt (if your experience is different, I'd be interested to read you). For swarm to be able to work directly with fastq files, we would need to duplicate these rather complex pieces of software.

We are trying to keep swarm light and streamlined: swarm should be an element in a pipeline, not a pipeline in itself.

Thanks for understanding,

deprekate commented 9 years ago

I have amplicon unpaired reads, which are already trimmed, in fastq format. I just want to remove all technical replicates (allow ~1bp mismatch for sequencing error).

I guess VSEARCH is the tool I want, and not SWARM?

And yep, I am using SWARM in a pipeline to replace a sequence assembly step, when users have amplicon reads, instead of shotgun reads.

colinbrislawn commented 9 years ago

@frederic-mahe :clap:

We are trying to keep swarm light and streamlined: swarm should be an element in a pipeline, not a pipeline in itself.

Thank you for developing Swarm according to the linux philosophy of 'do one thing well'

@deprekate I would recommend converting your fastq files to fasta, then dereplicate with vsearch. sed -n '1~4s/^@/>/p;2~4p' sequences.fastq > sequences.fasta vsearch -derep_fulllength sequences.fasta -sizeout -output sequences.derep.fna If you are only interested in removing '~1bp mismatch', you may be able to use swarm with d=1

frederic-mahe commented 9 years ago

Thanks @colinbrislawn,

indeed the next command is:

swarm -d 1 -z -w sequences.derep.seeds.fna sequences.derep.fna > /dev/null

to clusterize using ;size= abundance pattern and to collect representative sequences in fasta format (hence using swarm as a denoising method).

frederic-mahe commented 9 years ago

I am going to close that issue. Feel free to re-open it if need be.