Mapping primers to Illumina reads

lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences

https://lh3.github.io/minimap2

Other

1.72k stars 401 forks source link

Mapping primers to Illumina reads #809

Open sivico26 opened 2 years ago

sivico26 commented 2 years ago

Hi,

I have a particular case and wanted to know If I can address this with minimap2. I have some Illumina reads that come from several amplicons and I also have a list of primers that generated those amplicons. In the end, I want to separate the reads by primer.

My reasoning is to map each primer to all the reads (something like an all vs all) and then parse the .paf to see which primer does each read better match.

Nevertheless, I am trying some configurations and get no matches at all. I think the fact that the primers are so short is the main issue, but I do not how to properly configure minimap2 to handle this.

So, do you think this is feasible with minimap2? If yes, how should I configure it? If no, would you mind suggesting to me an alternative?

Thank you in advance for your thoughts. Cheers

Koen-vdl commented 1 year ago

I tried doing the exact same thing with ONT reads just now. A minimap2based amplicon read-splitter would be lightning fast.

minimap2 -t 64 -x map-ont primers.fasta ONT.fastq.gz

Unfortunately, I also get no hits using 6 primer sequences of about 21bp as a reference to map to.

Some guidance on mapping reads to a reference made up of especially short primer sequences would be awesome @lh3 !

lh3 commented 1 year ago

Oh, I missed this question. You might try

minimap2 -cxsr -t16 -k11 -w5 ONT.fastq.gz primers.fasta

Keep 5'- and 3'-end primer sequences interleaved in primers.fasta

lh3 commented 1 year ago

Increase -k if the command line above is too slow.

Koen-vdl commented 1 year ago

Still no hits with: minimap2 -c -x map-ont -t 60 -k 11 -w 5 primers.fasta barcode91_ITS2.fastq.gz

I uploaded the input files in case you wish to take a look @lh3: https://e.pcloud.link/publink/show?code=kZKsBhZOmbzkJGHex4QEOxJyXhHdBw5BO2k

I can visually see the primer regions in the reads... minimap2is just not picking them up (marked in red in picture):

JDavidson2019 commented 1 year ago

@Koen-vdl , were you able to find a resolution to this issue, as I am facing the same issue in some of my analysis workflows. Thanks!

Koen-vdl commented 1 year ago

Hi @JDavidson2019, Unfortunately, I did not find a fix. As a (slow) workaround, I'm currently using the following seqkitapproach:

seqkit fish --threads $threads -q 20 -F ATGCTTAAATTTAGGGGGTAGTC input.fastq.gz &> ITS2_matches.txt
cut -f 1  ITS2_matches.txt | sort -u > ITS2_readIDs.txt
seqkit grep -f ITS2_readIDs.txt input.fastq.gz -o filtered_ITS2.fastq.gz

JDavidson2019 commented 1 year ago

@Koen-vdl ,

Thank you so much for your response. I tinkered around with it a bit more, and was not able to find a solution either. I got an idea from this paper (https://doi.org/10.1101/2020.08.07.20161737) to use a utility called vsearch and was able to get alignments for your primers using the following command and your test data:

vsearch -maxaccepts 0 --maxrejects 0 --id 0.75 --strand both --wordlength 5 --minwordmatches 2 --usearch_global primers.fasta --db barcode91_ITS2.fastq.gz --alnout test.aln

The command was able to execute quite quickly, so hopefully this workaround will be useful for you.

Thanks again for your help!