shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.32k stars 159 forks source link

seqkit amplicon only keeps one amplicon #191

Open mlosilla opened 3 years ago

mlosilla commented 3 years ago

Hi,

In version 0.15.0 seqkit amplicon only keeps one amplicon (the largest) per primer pair. I don't always care about the largest amplicon.

Would it be possible to implement the feature of keeping all valid amplicons? A bed file with the start and end bases (columns 2 and 3) of all valid pairs would be fantastic. That would permit me to inspect the amplicons and keep the one I want (in my case, the smallest amplicon over a certain threshold).

Thanks

shenwei356 commented 3 years ago

Right, PCR could produce all combinations of the forward and backward primers. We should output them too.

mlosilla commented 3 years ago

yeah exactly.

In my case, the positions in the bed file would be enough. I would choose the amplicon I want and use the positions as trimming points for my fastq reads--but the sequences of the PCR products may be useful for other users.

Thank you for considering it!

shenwei356 commented 3 years ago

In my case, the positions in the bed file would be enough.

Oh, you can use seqkit loate first, which outputs BED format.

zjhzxjm commented 2 years ago

期待这个功能早日上线

MostafaYA commented 7 months ago

Hi, I have the same issue here. Outputting the largest amplicon only was misleading in my case, as I was looking for an amplicon within repititive RNA genes of an eukaryotic genome. Instead of predicting a correct PCR amplicon of size around 400bp, seqkit amplicon produced an amplicon of ~50 Kb which does not make any sense in my case.

As a solution, I see that outputting all valid amplicons is the best option, or you may add an option expected_amplicon_size to give this length a priority or to make an upper limit for the amplicon prediction.

Here is a toy example

echo -ne ">seq\nAGTACCTTGGTAGGAGTTTCCTGCTAATGATAAGAATGATATTGGACTAAGTAATGTTGCAAATATAGAAACTGAT\n" | seqkit amplicon -F GGTAGG -R ATCAG
echo  "AGTACCTTGGTAGGAGTTTCCTGCTAATGATAAGAATGATATTGGACTAAGTAATGTTGCAAATATAGAAACTGAT" > seq 
echo -ne ">seq\n" > multi_seq.fa 
for seq in {1..10}; do cat seq >> multi_seq.fa; done 
cat multi_seq.fa | seqkit amplicon -F GGTAGG -R ATCAG | seqkit stats