ocxtal / minialign

[IMPORTANT: not for real data analysis, only for algorithm evaluation] fast and accurate alignment tool for PacBio and Nanopore long reads
MIT License
126 stars 9 forks source link

capturing deletions #1

Open zeeev opened 7 years ago

zeeev commented 7 years ago

Greetings,

I've noticed that alignments break over small deletions. Is there a way to control the size of deletion an alignment can contain?

Thank you,

Zev

ocxtal commented 7 years ago

Hi, Zev

In short, there is no way to control acceptable insertion / deletion size. Although the X-drop threshold might be used for the purpose setting very small value, it does not seem to be a good choice because it also splits alignments on low-identity regions.

Actually, the detailed answer depends on the length of the deletion.

  1. If it is shorter than 25 bases, the behavior might came from a bug in the alignment routine. I would appreciate if you could show the actual sequence pair that reproduce the case (and options provided to the program).

  2. Unfortunately, if it is longer than 25 bases, the behavior is a limitation of the alignment routine. The program uses a 32-cell fixed wide banded alignment with adaptive steering technique. The algorithm is confirmed by experiment that dropping indels longer than 25bases while capturing perfectly shorter than it. (The line BW = 32, the fourth line from the left, in the Figure 2(d) shows the trend: https://github.com/ocxtal/adaptivebandbench ) Since the reason of the algorithm selection is the good performance and efficiency of the adaptive band algorithm, i'm sorry but the limitation will not be alleviated in the future...😢

Thanks,

Hajime

ocxtal commented 7 years ago

Hi, Zev

Minialign is now updated to version 0.3.2. In this release some bugs in the chaining routine, which made the chained path collapsed when it reached the head of the query sequence, are fixed. The chaining parameters, side lengths of the parallelogram window, are now modifiable with '-L' and '-H' flags (and the defaults are also enlarged to 5000, in order not to split chain around low-identity regions). I'm glad if you could test this new version.

Thank you.

Hajime

zeeev commented 7 years ago

@ocxtal Sorry I didn't reply sooner. Thank you for the updates. I will re-run the alignments after the thanksgiving holiday. What parameters would you suggest for -L and -H to maximized INDEL/SV detection?

ocxtal commented 7 years ago

Hi, Zev

Recommended -L and -H settings is difficult (since I'm not familiar with indel/SV calling...), hmm...

Currently I believe that large indel detection should be resolved in the postprocess of the local alignment and could be a preprocess of the SV detection program. However, if you say the large indels must be captured in the local alignment stage, I'll consider adding indel detection algorithm (alignment linking and gap filling) as a postprocess of the calculation of the alignment set.

Regards,

Hajime Suzuki

ocxtal commented 7 years ago

Hi, Zev

Just now I have figured out that the problem is: the extension alignment terminated just before the indels and the following matching regions were not reported...! (I am sorry to be late to understand...😢) I have confirmed the phenomenon on my simulated data and I'll add downstream-rescuing algorithm in the next release.

Thanks,

Hajime

ocxtal commented 7 years ago

Hi, Zev,

I'm sorry for my delayed reply. I've just pushed the minor update, 0.4.2, with a downstream alignment rescuing algorithm. The algorithm still fails collecting alignments after short indels, it performs much better than the previous release, 0.4.1. Please try it out.

Here are pileups of my test data.

minialign 041 minialign-0.4.1 (default params)

minialign 042 minialign-0.4.2 (default params)

bwamem bwa-mem (default params), as a reference

Thanks,

Hajime Suzuki