torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Problem with fasta header ??? #116

Closed davidvilanova closed 5 years ago

davidvilanova commented 5 years ago

Hi, While using swarm on dereplicated seqs using vsearch i got the following error

Vsearch output Looks fine

127085163 nt in 302191 seqs, min 260, max 446, avg 421
Sorting 100%
74094 unique sequences, avg cluster 4.1, median 1, max 9503
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

swarm output

swarm -f -t 3 -d 1 -z $outdir/derep.fa -s $outdir/amplicons.stats -u $outdir/ucfinal.txt -w $outdir/OTUs_temp.fa -l $outdir/logswarm > $outdir/amplicons.swarms

Error: Abundance annotations not found for 74094 sequences, starting on line 1.
>ech97_125
Fasta headers must end with abundance annotations (_INT or ;size=INT).
The -z option must be used if the abundance annotation is in the latter format.
Abundance annotations can be produced by dereplicating the sequences.
The header is defined as the string comprised between the ">" symbol
and the first space or the end of the line, whichever comes first.

The dereplicated file (derep.fa). This is the first sequence of the file

>ech97_125 M02944:264:000000000-BVKB2:1:1102:7409:22406;size=9503
TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAACGATGAAGGCCTTCGGGTCGTAAAGTTCTGTTGTTAGGGAAGAACAAGTACCGTTCAAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCA
GCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCTCTTAAGTCTGATGTGAAATCTCGCGGCTCAACCGCGAGCGGCCATTGGAAACTGGGAGGCTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCG
GTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA

Using vsearch v2.8.5_linux_x86_64 and Swarm 2.2.2 [Jun 21 2018 17:48:50] on linux.

torognes commented 5 years ago

Thanks for reporting this issue.

Could you please provide the command line that you used to dereplicate the sequences with vsearch as well?

The problem seems to be caused by the space in the header line after "ech97_125". Swarm will interpret this as the end of the identifier. It will not find an abundance value in the expected format (;size=), because the -z option was given, and therefore complain. The abundance of 9503 computed by vsearch is ignored.

Usually vsearch will drop the part of the header after the space, but I think it may include it when converting directly from FASTQ.

davidvilanova commented 5 years ago

Oh i see, Here is the command line:

vsearch --derep_fulllength Sequences.fa.gz --output $outdir/derep.fa --uc=$outdir/derepuc.uc --log=$outdir/log.derep --minuniquesize 1 --fasta_width 0 --sizeout The input file is a fasta file, containing the space. I can remove everything after the space and keep the ;size=INT

For intance if it change the header to >ech97_125;size=9503 it should work ?

davidvilanova commented 5 years ago

I´ve used awk to clean headers on derep.fa awk -F'[ :;]' '{print $1,$NF}' OFS=";" $outdir/derep.fa

Probably should have cleaned headers early on....

torognes commented 5 years ago

Thanks.

For intance if it change the header to >ech97_125;size=9503 it should work ?

Yes, that should work.

Unfortunately, there seems to be an issue with the derep_fulllength command in VSEARCH. It fails to remove everything in the header starting with the space. That's a bug. Sorry for that. I'll fix it.

frederic-mahe commented 5 years ago

I´ve used awk to clean headers on derep.fa awk -F'[ :;]' '{print $1,$NF}' OFS=";" $outdir/derep.fa

or sed -i '/^>/ s/ .*;size=/;size=/' file.fas

Thanks for reporting that issue @davidvilanova

frederic-mahe commented 5 years ago

The issue in vsearch: https://github.com/torognes/vsearch/issues/338

davidvilanova commented 5 years ago

Thanks guys for quick response !!!

torognes commented 5 years ago

The bug is fixed in vsearch 2.8.6, just released.