lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.79k stars 408 forks source link

[enhancement] Parameter to limit alignment to one strand #91

Closed yupenghe closed 6 years ago

yupenghe commented 6 years ago

Hi Heng, I am trying minimap2 on DNA methylation data from whole-genome bisulfite sequence assay. The data processing needs to align reads to only forward or reverse strand of reference genome. Is there an easy way to do that? Thanks!

Minimap2 is pretty easy to install and the speed is amazing. I would like to incorporate it as part of a pipeline for DNA methylation data I am working on. If mapping to one strand is possible, Minimap2 could easily be part of the pipeline and hopefully shorten the runtime a lot.

Yupeng

lh3 commented 6 years ago

Are you aligning Illumina or pacbio/nanopore reads? Anyway, I will have a look to see how complex it is to implement the feature.

yupenghe commented 6 years ago

Thanks! They are illumina reads: PE (100-150)x2 and SE 100-150.

lh3 commented 6 years ago

How do you deal with paired-end reads? The two ends are mapped to different strands?

yupenghe commented 6 years ago

Yes. For paired-end data, it is about the strand of fragment. So, we will need one option that requires 5' end on forward strand and 3' end on reverse strand. Another option enables 5' end to be mapped on reverse strand and 3' end to be mapped to forward strand.

In the current pipeline, I use bowtie2 as aligner and the option "--nofw/--norc" to do stranded mapping.

lh3 commented 6 years ago

I have just added --for-only and --rev-only to the master. It seems to work for a few reads. It would be good if you test the master more thoroughly before I conclude this issue. Thanks.

armintoepfer commented 6 years ago

It works for me, but I don't see a significant speed boost. Working with CCS pacbio reads. But maybe I'm limited by the DP step for cigar calculation.

lh3 commented 6 years ago

If you know all reads coming from one strand of the reference, mapping to one strand only won't help performance at all. It only helps when half of your reads can't be mapped to one strand.

armintoepfer commented 6 years ago

Yeah and I would have thought you try both strands, nevertheless :) But it works, I don't have any false-negative calls.

lh3 commented 6 years ago

Thanks for the confirmation!

yupenghe commented 6 years ago

Thanks! I will let you know how it works.

yupenghe commented 6 years ago

Hi Heng, The two options work perfectly.

The results look good. I also did benchmark to compare minimap2 and bowtie2 on methylation data. Their precisions and recalls are very close. bowtie2 (2.2.5) works a little bit better than minimap2 (2.7-r655-dirty).

If you are interested, I have packed the code and files needed to reproduce the results here. The tar ball (~30g) contains the final result (in mapping_results.tsv) and all intermediate files. You may want to check out README.md and run_test.sh for the details.

Yupeng

lh3 commented 6 years ago

Thanks a lot for the data. One question: how did you check if an alignment is correct? I ask because for mapping quality >=55, minimap2 should barely make mapping mistakes, but in your table, 1.2% of mapQ>=55 reads are wrong, which is unexpected. My guess is that you are requiring exact coordinate match. However, this usually overestimates the mapping error rate because when there are gaps, the mapping coordinates may be shifted a little. This is particularly true to minimap2/bwa, which may add soft clippings.

With my evaluation script used to draw Figure 1 in my preprint, I got the following figure below. The mapping error rate in the best mapQ bin is much smaller. You can see the criterions in the figure captions. I can help you to run my script. It is not easy to use, though.

I am not saying my evaluation is better. I just want to understand the differences. Evaluating one's own software is almost always biased. Thanks.

roc-color

yupenghe commented 6 years ago

Thanks. The higher error might be due to the different setting. For the methylation data, the aligner has to deal with three-base genome (there are some explanation in the README.md file). Briefly, Cs in reads are converted to Ts. For converted forward reference, Cs are converted to Ts, while Gs are converted to As for converted reference. I can try to simulate normal reads from unconverted genome and see if I can reproduce your results.

The criteria for a read to be counted as match is that the 5' coordinate is exactly the same as the 5' coordinate of true location. The 3' coordinate has no impact on the evaluation. I am happy to try out your script.

Please don't worry. I am very keen to include minimap2 into my pipeline because it runs way faster than bowtie2 and want to explore its potential.

lh3 commented 6 years ago

When there is a true gap within a couple of bp towards the 5'-end of the alignment, the gap is often aligned as mismatches and shifts the 5'-end coordinate a little. The choice between mismatch and gap is primarily determined by the Smith-Waterman scoring system. If you require exact 5'-end coordinate match, you will be essentially evaluating which aligner uses a scoring system closer to the scoring used in simulation. Furthermore, bwa-mem and minimap2 may add softclips to 5'-end if there are multiple mismatches towards the end of the alignment. This changes the 5'-end coordinate, too.

In such evaluations, the better way is to mark an alignment correct if the alignment coordinate is within, say, 10bp around the true coordinate. This is easy to implement and more correctly measures the mapping error rate.

yupenghe commented 6 years ago

I agree. I change the criteria for matching as the 5' position of alignment to be within +/- 10bp range of actual 5' position of true location. Please find the file attached

mapping_results_adj.txt

lh3 commented 6 years ago

In your file, 741 reads of mapQ>=55 in aln_minimap2_f.bam are wrong. This is different from the number in my script. Is it possible to give me the names of these reads? I just want to double check if my script has a bug, which will affect a manuscript in submission. Thanks in advance.

yupenghe commented 6 years ago

It seems that all of those reads have some soft-clipped bases.

wrong_aln_minimap2_f.txt

lh3 commented 6 years ago

Thank you very much. For half of these reads, bowtie2 are not reporting correct CIGARs, either. Take this bowtie2 alignment for example:

seWGBS_w.fastq.000053246:chr16:52945298:52945453:forward   0  chr16  52945304  42  150M

There is a true indel in the first 6bp of the read. Minimap2 clips the 6bp. Bowtie2 aligns the 6bp with 3 mismatches. This makes more bases align and passes the 10bp threshold, but the cigar is wrong. Minimap2 is more aggressively reporting softclip because for real data this happens often due to untrimmed adapters, SVs and errors in the human genome. Arguably, clipping poorly aligned bases is not a type of mapping error.

Looking at the source code, I realize my evaluation script is doing the following. If you see an case like:

|<- l1 ->|<-- o -->|<--- l2 --->|
------------------->                alignment
          ---------------------->   truth

my script regards the alignment to be correct if o/(l1+o+l2)>=0.8 (i.e. the joint length divided by union length is above a threshold). It uses a fraction rather than a length to make the script robust to different read lengths. I am sorry that I was not explaining my strategy precisely, because I forgot the details.

yupenghe commented 6 years ago

Thanks very much for the explanation. I think what you implement in minimap2 makes more sense. Dense mismatches are strong indicator of indel.

  1. By the way, do you have plan to allow multiple fastq files (comma separated) as input? I think it will be a useful feature. For example,
minimap2 --secondary=no --rev-only -t 8 -ax sr index/mm10_r.mmi seWGBS_1.fastq,seWGBS_3.fastq,seWGBS_3.fastq
  1. Here is another change you may want to consider. I found that I have to set --secondary=no for every command involved minimap2. It is better to set "no" as the default but I am not sure if secondary alignments are very useful in other cases.

Again, thanks for all your effort in developing this great aligner.