shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.32k stars 160 forks source link

[feature suggestion] Reverse translate protein search expression into nucleotide regex or degenerate base sequence #418

Open samuell opened 1 year ago

samuell commented 1 year ago

Prerequisites

Describe your issue

I'm having the usecase where I located a small "motif" in a protein sequence, that I'm interested in finding again in the nucleotide sequence coding for the protein.

The sequence I was looking for, expressed as a regex is the following, so let's use that as an example here (. is of course any letter, as per standard regex syntax):

E.SM.YSDN

I would now want to be able to seqkit grep against not only protein sequences, but also nucleotide ones.

By using a genetic code table I can do this by manually converting this sequence into a (DNA) nucleotide regex like this one (where [XY] are character classes allowing any of X and Y in one position):

GA[AG]...AG[CT]ATC...TA[CT]AG[CT]GA[CT]AA[CT]

Now, it would be useful to not need to do this translation manually, but rather be able to do something similar to:

seqkit grep --by-seq -r --protein-to-nucleotide -p "E.SM.YSDN" nucleotide_sequences.fa

Of course, the similar thing could be done using degenerate amino acid / bases too, if that is preferred over regular expressions.

shenwei356 commented 1 year ago

That would be achieved, but is tblastn simpler and faster?

samuell commented 1 year ago

That would be achieved, but is tblastn simpler and faster?

Perhaps! In my own quick try, it seemed that I need to put my query sequence into a file before running it, but there is perhaps some way to do this more easily.

I can explore this option a little more.