Closed PRTGRWL closed 1 year ago
Since you are accounting for degenerate sequences and different letter cases, if you are missing any matches, it's possible that some sequences lie at the edges of your other sequences and don't happen to agree on having three flanking letters on both sides.
Maybe a regex would help. Try:
>1
.{0,3}GTCTCG.{0,3}
>2
.{0,3}CTTTG.{0,3}
If this doesn't work please kindly share a reproducible data example.
Thanks for your reply
I made a test fasta file with a sequence as below:-
1 ATTGCGTTGNAT
Then use seqkit locate to find TTGN ... i found only one pattern I.e TTGC but not TTGN
So, i doubt that tool recognise 'atgc' but not N to substitute by N..
Also, i was finding a stretch of nucleotide using seqkit locate where i know 2 patterns lying at ends of stretch with a fix gap as shown below:
Pattern1--------‐‐-----------------Pattern2
In between pattern motifs I added Ns of fix length
Pattern1NNNNNNNNNNNNPattern2
So, I know its there in genome but not able to locate it anywhere
So, how this can be solved through seqkit??
On Sat, 8 Oct, 2022, 3:24 am Juan MONROY-NIETO, @.***> wrote:
since you are accounting for degenerate sequences and different letter cases, if you are missing any, it's possible that some sequences lie at the edges of your other sequences and don't happen to agree on having three flanking letters on both sides.
Maybe a regex would help. Try:
1 .{0,3}GTCTCG.{0,3} 2 .{0,3}CTTTG.{0,3}
— Reply to this email directly, view it on GitHub https://github.com/shenwei356/seqkit/issues/331#issuecomment-1272114670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYB7OM7VGGKS5G2CZF2YOFTWCCL2NANCNFSM6AAAAAAQ747RMQ . You are receiving this because you authored the thread.Message ID: @.***>
TTGN ... i found only one pattern. I.e TTGC but not TTGN
As @jmonroynieto mentioned, when the search pattern contains degenerate bases, we search the bases they represent, not the degenerate bases themselves. If you do want this, use regular expressions.
Pattern1--------‐‐-----------------Pattern2
Replace NNNN
with .+
and switch on -r/--use-regexp
. Or use seqkit amplicon
, note that the pattern2 should be convert to its reverse complement sequence.
Thanks a lot for the help
Thank you so much sir
On Sat, Oct 8, 2022 at 6:23 AM Wei Shen @.***> wrote:
TTGN ... i found only one pattern. I.e TTGC but not TTGN
As @jmonroynieto https://github.com/jmonroynieto mentioned, when the search pattern contains degenerate bases, we search the bases they represent, not the degenerate bases themselves. If you do want this, use regular expressions.
Pattern1--------‐‐-----------------Pattern2
Replace NNNN with .+ and switch on -r/--use-regexp. Or use seqkit amplicon, note that the pattern2 should be convert to its reverse complement sequence.
— Reply to this email directly, view it on GitHub https://github.com/shenwei356/seqkit/issues/331#issuecomment-1272182288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYB7OM5WXQ35CZTUWMTW7NTWCDAZNANCNFSM6AAAAAAQ747RMQ . You are receiving this because you authored the thread.Message ID: @.***>
-- PREETI AGARWAL Ph.D. Scholar (lab604A) Principal Investigator Dr. Jitendra Narayan CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB) CSIR-IGIB North Campus, Mall Road, Delhi -110007, India
Hi , I am using seqkit version v2.3.0
Issue : I need to find the flanking region of the patterns "GTCTCG" and "CTTTG" from the complete concatenated multiple genome file. pattern_file.fa contains two patterns in fasta format "NNNGTCTCGNNN" and "NNNCTTTGNNN"
I used $seqkit locate -i -d -P -f pattern_file.fa concat_genome.fa > pattern.tsv
I got the result for some genome files where I got "ATGC" nucleotides in replace to "N" but didn't obtained the same for all genome files. I found that if a pattern in genome file is flanked by ATGC, I can get the result but if pattern is flanked by even a single N , then I don't obtain the result. Is there any error with seqkit locate or am I using it incorrectly ?