shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.32k stars 160 forks source link

Use seqkit grep to extract fasta by matching ID at 2 times (e.g. --delete-matched means 1 times) #484

Closed permia closed 3 months ago

permia commented 3 months ago

Hi Shen,

I translated a big nucletide file (~ 2.5G) with Transdecodes, and the translated protein file is also big (1.2 G).

The IDs of translated proteins are similar to ID-A.p1 ID-A.p2 ID-A.p3 etc.

In some case, I want to extract all (or > 1) of the translated sequences of one ID.

Because I have 10000 ID to extract. it's slow to extract these sequences using the following code.

seqkit grep -w 0 --id-regexp "^(\\S+)\\.p\d+\\s?" -f id.txt longest_orfs.fasta -o all_ORF.fasta

And, there is an option --delete-matched, which would make the extraction faster if you only want one hit.

Is there any option that I can set the match times of one pattern? which would make the extraction faster.

seqkit grep -w 0 --delete-matched --id-regexp "^(\\S+)\\.p\d+\\s?" -f id.txt longest_orfs.fasta -o largest_ORF.fasta

shenwei356 commented 3 months ago
  1. 10000 ID is not a big number.
  2. Do not use --delete-matched if you want all translated sequences.
  3. The slow speed might be due to parsing non-classic sequence ID or the huge number of sequences.
  4. Just run the command below and wait.

    seqkit grep -w 0  --id-regexp "^(\S+)\.p\d+" -f id.txt longest_orfs.fasta -o largest_ORF.fasta