Use seqkit grep to extract fasta by matching ID at 2 times (e.g. --delete-matched means 1 times)

shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

MIT License

1.32k stars 160 forks source link

Hi Shen,

I translated a big nucletide file (~ 2.5G) with Transdecodes, and the translated protein file is also big (1.2 G).

The IDs of translated proteins are similar to ID-A.p1 ID-A.p2 ID-A.p3 etc.

In some case, I want to extract all (or > 1) of the translated sequences of one ID.

Because I have 10000 ID to extract. it's slow to extract these sequences using the following code.

seqkit grep -w 0 --id-regexp "^(\\S+)\\.p\d+\\s?" -f id.txt longest_orfs.fasta -o all_ORF.fasta

And, there is an option --delete-matched, which would make the extraction faster if you only want one hit.

Is there any option that I can set the match times of one pattern? which would make the extraction faster.

seqkit grep -w 0 --delete-matched --id-regexp "^(\\S+)\\.p\d+\\s?" -f id.txt longest_orfs.fasta -o largest_ORF.fasta

shenwei356 / seqkit