lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 415 forks source link

Alignment of nanopore cDNA reads to a tiny exon (6 nucleotides) #253

Closed arnederoeck closed 5 years ago

arnederoeck commented 6 years ago

Hi, we generated PCR amplified cDNA of our gene of interest and sequenced it on MinION, obtaining more than 10,000 x coverage of the transcript. We use these reads for phased analysis of alternative splicing. minimap2 --splice does a great job overall. However, for 1 (alternative) exon of 6 nucleotides long (with a flanking splice acceptor and splice donor site), we are not getting any coverage/splice junctions on the expected coordinates. This exon is confirmed by Gencode, and we know it should be present in our data based on Sanger sequencing of our cDNA libraries. Presumably this extra 6nt sequence is erroneously added to one of the flanking exons, or the reads are not mapped.

I presume that due to the errors in nanopore sequencing, correct alignment of this 6nt sequence will not reach 100% efficiency, but we would like to check it out anyway. Hence my question: Are there any parameters in minimap2 that we could change in order to maximize the chance to obtain this tiny exon?

Many thanks in advance! Arne

tseemann commented 6 years ago

Is it possible for the 6 bp exon to even be aligned when k=15 ?

lh3 commented 6 years ago

Minimap2 is unable to align small exons, even for perfect reads. This is a limitation without an easy workaround.

Minimap2 essentially runs an intron-aware Smith-Waterman (SW) to find introns. It only opens an intron when the alignment with an intron is much better than the alignment without an intron. A 6bp exon can't compensate the penalty of an extra intron. With the current setting, it will always be merged into a nearby exon.

It is in theory possible to adjust the scoring to allow a 6bp exon. However, this is likely to introduce false exons for many other transcripts. This will become worse for Nanopore data.

In the long run, I think a sensible solution is to make minimap2 aware of existing gene annotations (#197). This won't happen soon, unfortunately.

Is it possible for the 6 bp exon to even be aligned when k=15?

Minimap2 doesn't require a k-mer in each exon. For example, if the first and the 4th exons are long enough and can be anchored, minimap2 may correctly align through the 2nd and 3rd exons even if they are not anchored by any k-mer. Minimap2 can't align this 6bp exon mostly due to SW scoring.

arnederoeck commented 6 years ago

Thank you for the clarification! In case I were to adjust the SW penalties to allow a 6nt exon, I presume I should be looking at lowering -O and -E? I'm aware this most likely results in false alignments, but I would like to give it a try. Since I'm looking at a single gene for now, it could still deliver some useful insights.

lh3 commented 6 years ago

You don't need to change -E. You may start with

minimap2 -cx splice -B3 -O3,20

If all 6bp are mismatches when not aligned as an exon, the above should work. However, if there are just a few mismatches, you need to reduce intron penalty even further.

lh3 commented 6 years ago

You may also try -B4 -O4,20 up to -B9 -O9,20. With a large -B, minimap2 is likely to fragment the alignment, though.

arnederoeck commented 6 years ago

Thank you, I will check it out!

armintoepfer commented 6 years ago

Welcome to this year's class "How to overfit 101" 😂

lh3 commented 6 years ago

Yeah. For one gene, this is definitely overfitting. Nonetheless, this experiment might help to improve Iso-seq mapping. The current scoring is more for low-quality nanopore data. There should be room for improvement for Iso-seq just by tuning the scores. A better setting should improve the detection of micro exons and non-canonical splicing. Choosing the right SW scores is quite tricky. I haven't done a systematic study.

arnederoeck commented 6 years ago

Thank you for pointing this out, I'm well aware of the potential overfitting issues here, and will interpret the results accordingly.

arnederoeck commented 6 years ago

At the hazard of offending some statisticians, I'm going to show a few results concerning the overfitting of SW penalty parameters, because I think that there could be some value in these results with regard to alignment of micro exons.

In the figure below I'm showing 12 splice junctions (both canonical and alternative splice junctions), each in a panel. The y-axis depicts the ratio of sequencing reads supporting each junction, and the x-axis and colours correspond to different minimap2 parameters. The first parameter (-B2 -O2,32) corresponds to the standard splice parameters of minimap2. The others are alternative combinations based on the suggestions above.

Junctions 5 and 7 respectively start and end with the splice acceptor or splice donor site of the 6nt exon. Junction 6 is the inversely correlated canonical splicing over the region. As you can see, the standard parameters for minimap2 (as well as -B3 -O3,12 and -B3 -O3,14) do not pickup the 6nt exon and (wrongfully) show canonical splicing instead. Lowering the second integer of -O to 10 or lower in this case enables the detection of the 6nt exon. In contrast, quantification of the other junctions, which are not related to the 6nt exon stay roughly the same.

Hence for the purpose of comparing the relative abundance of (phased) alternative splicing between different samples and for this particular gene and project, I think changing the parameters to allow micro exon alignment is acceptable, but please let me know if someone thinks otherwise.

Thanks for the help and best regards!

swpenalties_iterations