shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.28k stars 157 forks source link

seqkit locate error #331

Closed PRTGRWL closed 1 year ago

PRTGRWL commented 1 year ago

Hi , I am using seqkit version v2.3.0

Issue : I need to find the flanking region of the patterns "GTCTCG" and "CTTTG" from the complete concatenated multiple genome file. pattern_file.fa contains two patterns in fasta format "NNNGTCTCGNNN" and "NNNCTTTGNNN"

I used $seqkit locate -i -d -P -f pattern_file.fa concat_genome.fa > pattern.tsv

I got the result for some genome files where I got "ATGC" nucleotides in replace to "N" but didn't obtained the same for all genome files. I found that if a pattern in genome file is flanked by ATGC, I can get the result but if pattern is flanked by even a single N , then I don't obtain the result. Is there any error with seqkit locate or am I using it incorrectly ?

jmonroynieto commented 1 year ago

Since you are accounting for degenerate sequences and different letter cases, if you are missing any matches, it's possible that some sequences lie at the edges of your other sequences and don't happen to agree on having three flanking letters on both sides.

Maybe a regex would help. Try:

>1
.{0,3}GTCTCG.{0,3}
>2
.{0,3}CTTTG.{0,3}

If this doesn't work please kindly share a reproducible data example.

PRTGRWL commented 1 year ago

Thanks for your reply

I made a test fasta file with a sequence as below:-

1 ATTGCGTTGNAT

Then use seqkit locate to find TTGN ... i found only one pattern I.e TTGC but not TTGN

So, i doubt that tool recognise 'atgc' but not N to substitute by N..

Also, i was finding a stretch of nucleotide using seqkit locate where i know 2 patterns lying at ends of stretch with a fix gap as shown below:

Pattern1--------‐‐-----------------Pattern2

In between pattern motifs I added Ns of fix length

Pattern1NNNNNNNNNNNNPattern2

So, I know its there in genome but not able to locate it anywhere

So, how this can be solved through seqkit??

On Sat, 8 Oct, 2022, 3:24 am Juan MONROY-NIETO, @.***> wrote:

since you are accounting for degenerate sequences and different letter cases, if you are missing any, it's possible that some sequences lie at the edges of your other sequences and don't happen to agree on having three flanking letters on both sides.

Maybe a regex would help. Try:

1 .{0,3}GTCTCG.{0,3} 2 .{0,3}CTTTG.{0,3}

— Reply to this email directly, view it on GitHub https://github.com/shenwei356/seqkit/issues/331#issuecomment-1272114670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYB7OM7VGGKS5G2CZF2YOFTWCCL2NANCNFSM6AAAAAAQ747RMQ . You are receiving this because you authored the thread.Message ID: @.***>

shenwei356 commented 1 year ago

TTGN ... i found only one pattern. I.e TTGC but not TTGN

As @jmonroynieto mentioned, when the search pattern contains degenerate bases, we search the bases they represent, not the degenerate bases themselves. If you do want this, use regular expressions.

Pattern1--------‐‐-----------------Pattern2

Replace NNNN with .+ and switch on -r/--use-regexp. Or use seqkit amplicon, note that the pattern2 should be convert to its reverse complement sequence.

PRTGRWL commented 1 year ago

Thanks a lot for the help

PRTGRWL commented 1 year ago

Thank you so much sir

On Sat, Oct 8, 2022 at 6:23 AM Wei Shen @.***> wrote:

TTGN ... i found only one pattern. I.e TTGC but not TTGN

As @jmonroynieto https://github.com/jmonroynieto mentioned, when the search pattern contains degenerate bases, we search the bases they represent, not the degenerate bases themselves. If you do want this, use regular expressions.

Pattern1--------‐‐-----------------Pattern2

Replace NNNN with .+ and switch on -r/--use-regexp. Or use seqkit amplicon, note that the pattern2 should be convert to its reverse complement sequence.

— Reply to this email directly, view it on GitHub https://github.com/shenwei356/seqkit/issues/331#issuecomment-1272182288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYB7OM5WXQ35CZTUWMTW7NTWCDAZNANCNFSM6AAAAAAQ747RMQ . You are receiving this because you authored the thread.Message ID: @.***>

-- PREETI AGARWAL Ph.D. Scholar (lab604A) Principal Investigator Dr. Jitendra Narayan CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB) CSIR-IGIB North Campus, Mall Road, Delhi -110007, India