lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

The subseq function adds tabs to the sequence headers #68

Closed deprekate closed 8 years ago

deprekate commented 8 years ago

The current version inserts a tab between the header ID and the comment. (I use the @ to show where tabs are)

$ seqtk 
Version: 1.0-r82-dirty

$ cat test.fna | tr "\t" "@"
>one comment
ACTG

$ cat test.lst | tr "\t" "@"
two

$ seqtk seq -A test.fna | tr "\t" "@"
>one comment
ACTG

$ seqtk subseq test.fna test.lst | tr "\t" "@"
>one@comment
ACTG

Line 569 of seqtk.c uses a tab instead of a space.

lh3 commented 8 years ago

Like SPACE, TAB is not part of fasta name, either.

deprekate commented 8 years ago

Yes, but the SPACE in a sequence header should not be substituted for a TAB. It causes problems downstream, especially for software that relies on the SPACE boundary to get the comment portion (which usually includes functional annotation and taxonomy information).

A TAB also messes up other things like, creating TAB delimited lists, and using the header as one column. It then throws off the column count.

tseemann commented 8 years ago

@lh3 I agree with @deprekate that the current behaviour is undesirable.

The vast majority of FASTA/Q files will use a SPACE between the ID and the COMMENT.

I suspect the TAB wasn't intentional in the FASTA/Q output mode, and printing a SPACE would be more compatible:

if (seq->comment.l) printf("\t%s", seq->comment.s);

should be

if (seq->comment.l) printf(" %s", seq->comment.s);

I hope you can make the change, or if you want a pull request I will do it.

lh3 commented 8 years ago

Fixed via e5c4fd9. Thanks.