torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
125 stars 23 forks source link

Strip chr(13) from input fasta files #72

Closed ZeweiSong closed 8 years ago

ZeweiSong commented 8 years ago

I got this message when trying the example FASTA file:

./swarm -t 4 -f -w myfile.fasta test.fasta myfile.swarm Swarm 2.1.6 [Dec 14 2015 10:59:14] Copyright (C) 2012-2015 Torbjorn Rognes and Frederic Mahe https://github.com/torognes/swarm

Please cite: Mahe F, Rognes T, Quince C, de Vargas C, Dunthorn M (2014) Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2:e593 https://dx.doi.org/10.7717/peerj.593

CPU features: mmx sse sse2 sse3 ssse3 sse4.1 sse4.2 popcnt avx Database file: test.fasta Output file: (stdout) Resolution (d): 1 Threads: 4 Scores: match: 5, mismatch: -4 Gap penalties: opening: 12, extension: 4 Converted costs: mismatch: 9, gap opening: 12, gap extension: 7 Break OTUs: Yes Fastidious: Yes, with boundary 3

' in sequence on line 2ror: Illegal character '

I just copied and paste what is in the example and save it in a .fa file:

seqID1_1000; ACTGTGACACGGGTGTGTGACACTGTGT seqID2_200; ACGCTACTATCGATGCGATCGATGCTAG

It also doesn't work when I tried to used the USEARCH style size label:

seqID1;size=1000; ACTGTGACACGGGTGTGTGACACTGTGT seqID2;size=200; ACGCTACTATCGATGCGATCGATGCTAG

But, it did work when I feed in the uchime_denovo output from vsearch, which actually have the size annotation on the second line:

derep_1124357; ;size=566972; AAGTCGTAACAAGGTTTCC derep_704594; ;size=279714; AAGTCGTAACAAGGTTTC

Any idea?

torognes commented 8 years ago

You need to have a ">" character in the beginning of each header line in the FASTA files.

The size annotation must be at the end of the header line. If you use the usearch-style abundance format (header lines ending with ";size=123;") you need to specifiy the "-z" option to Swarm. If you use the native abundance format (header lines ending with "_123") you must not have a semicolon at the end of the header line.

ZeweiSong commented 8 years ago

I did have the ">", GitHub treat > as comment that is why you can not see it. I've pasted it as code here.

>seqID1;size=1000;
ACTGTGACACGGGTGTGTGACACTGTGT
>seqID2;size=200;
ACGCTACTATCGATGCGATCGATGCTAG
torognes commented 8 years ago

Based on the error message you got there is probably some kind of illegal character in your input file. Try copying and pasting again (use paste text only or similar), perhaps from the text above. Based on the strange appearance of the error message it might be a stray carriage return character (ascii 13).

Also, you need to specify the "-o" option before the final file name on the command line if you want output to go there.

ZeweiSong commented 8 years ago

Thanks, I did fix it by pasting to a new file. Do you mean I have \r in my file?

torognes commented 8 years ago

Yes, it appears so based on the strange error message (' in sequence on line 2ror: Illegal character ').

It should probably be like this: "Error: Illegal character '\r' in sequence on line 2."

torognes commented 8 years ago

It seems like this problem is caused by files that have characters with ascii code 13 (CR, ^M) at the end of the lines. This is typical of files from DOS/Windows. These characters should be stripped. I'll reopen the issue.

torognes commented 8 years ago

This problem has been fixed in the new version 2.1.7. Thanks for reporting the problem.