lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

extracting isoforms with wildcard? #78

Closed KhudyakovLab closed 8 years ago

KhudyakovLab commented 8 years ago

Hi,

I am trying to use seqtk to extract fasta sequences of specific genes from a transcriptome assembly. I have a list of gene names (example: TR1234|c1_g1). In the assembly file, some of the genes contain several isoforms (TR1234|c1_g1_i1 and TR1234|c1_g1_i2). I would like to extract ALL the isoforms for each of the genes in my list. I tried using a list with wildcards for isoforms (i.e. TR1234|c1_g1_i*) but this did not work. Does anyone have any suggestions for how to deal with this issue?

Thanks so much, Jane

tseemann commented 8 years ago

You need to create a file of the sequence IDs you want to keep, say ids.txt, then run seqtk subseq file.fa ids.txt.

To get the ids.txt you can use standard unix tools like grep and sed and cut.

eg. grep c1_g1_i file.fa | sed 's/^.//' > ids.txt