lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 311 forks source link

rename full header #116

Open nwespe opened 6 years ago

nwespe commented 6 years ago

Hello,

I'd like to use seqtk rename to rename fasta headers from something like ">k141_2 flag=3 multi=4.0678 len=200" to ">sequence1". My command is: "seqtk rename input_file.fa sequence > output_file.fa". However, the program only renames the first space-delimited field of the header, so I get ">sequence1 flag=3 multi=4.0678 len=200". Is it possible for the program to replace the entire header, not just the first field?

Thanks

shenwei356 commented 6 years ago

It's simple, replace all the spaces with some other symbols before renaming:

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g'
>k141_2_flag=3_multi=4.0678_len=200
actg

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | sed 's/ /_/g' | seqtk rename - sequence
>sequence1
actg
nwespe commented 6 years ago

Thanks for the workaround. Ideally for me, this would be incorporated into the seqtk program somehow since I'm calling it via python subprocess and would like to limit the shell piping.

shenwei356 commented 6 years ago

Another workaround is calling seqtk twice, it may be still faster than a python script.

$ seqtk seq -C seqs.fa > seqs.fa.tmp
# >k141_2
# actg

$ seqtk rename seqs.fa.tmp sequence > result.fa
nwespe commented 6 years ago

I see - the seq -C option drops the comments in the header. So the perfect solution (for me) would be having the -C option in the rename function. I think this would make sense as a future enhancement to the rename function, since they are both operations on the headers, not the sequences.

shenwei356 commented 6 years ago

In view of modularization, a subcommand only does it's own task. And complex tasks can be done by piping multiple commands.

If you do want an one-command solution, here's one:

$ echo -en ">k141_2 flag=3 multi=4.0678 len=200\nactg\n" | seqkit replace -p ".+" -r "sequence{nr}"
>sequence1
actg
tseemann commented 6 years ago

@nwespe instead of subprocess in python you could just rename the megahit contigs with biopython or just pure python?

Here's my pseudo-code

in= open_for_read "orig"
out = open_for_write "new"
counter=0
while (line = in.getline) 
   if line.starts_with ">"
     line = ">seq" + counter
     counter++
  end
  out.putline line
end

Or even awk or bioawk

nwespe commented 6 years ago

@tseemann Thanks for the suggestion, I ended up implementing essentially your pseudocode. Piping the subprocess calls with error handling proved to be too complicated.

mr-eyes commented 3 years ago

I'm wondering if there is any trick to retrieve the delimiter after the seq->name.s. I want to retrieve the full header and the name.s + comment.s isn't accurate due to the variable delimiter.

mr-eyes commented 3 years ago

I got it worked by removing the + 1 in the following line. But, I don't know the consequences of that edit.

https://github.com/lh3/seqtk/blob/5806831b549e2fcbe5de2f78f5deb9673b75ea9d/kseq.h#L131

mr-eyes commented 3 years ago

Don't know yet the consequences, but it's working fine here. https://github.com/mr-eyes/kseq/