lh3 / unimap

A EXPERIMENTAL fork of minimap2 optimized for assembly-to-reference alignment
MIT License
85 stars 4 forks source link

Options to favor detection of several small deletions instead of single large ones #8

Open elcortegano opened 3 years ago

elcortegano commented 3 years ago

unimap does a great job aligning complex regions (e.g. satellites) even using reads. We are using it to detect some variants at telomeric regions, using PacBio HiFi sequencing data.

We have evidence from one of these regions that there should be 2 separate deletions, corresponding to two separate groups of monomers from a satellite, each of them of different size. However, unimap finds one large deletion instead, close together to several variants that suggest mismapping. I assume this behavior comes from the default settings, that may favour large deletions instead of many small deletions. The program was run with unimap -a -x asm5 -x hifi --cs. I attach snapshot below for this region.

telomere_5_unimap2

I have been trying to tune settings to penalize nucleotide mismatchings and favour more than 1 deletion if convenient. However, I find some problems including:

telomere_5_unimap3

Wonder if there are other options that are worth exploring here. Any hints on how to deal with this situation? Thank you

lh3 commented 3 years ago

Minimap2/unimap won't work with a very large -B because internally every score must be within [-128,127]. I should have added a test for that. Nonetheless, the failure of -O1 -E2 -B7 is caused by something else that I have overlooked. Note that under this scoring, the optimal alignment will have no mismatches. You can turn any mismatch to two gaps for a better score:

ATCATCA     ATC-ATCA
ATCtTCA     ATCt-TCA

This is probably violating the SW formulation unimap/minimap2 is using. You need to make sure ({-O}+{-E})*2>{-B} at least.

In your case, you can try -A1 -B4 -O6 -E2, or -A1 -B4 -O4 -E1.

elcortegano commented 3 years ago

Unfortunately, these options does not solve the issue, and still the majority of reads show the large deletion + mismatches.

From what I understand, it may be possible to try something like -B100 -O1 -E50, only that this would need to be in a future release to allow B > 7? We are aware that this might return unrealistic alignments for most of the genome, but for this particular region allowing multiple deletions could be more appropriate.

Thank you for the hints and feedback,

lh3 commented 3 years ago

The requirement is ({-O}+{-E})*2>{-B}; otherwise the alignment largely doesn't make sense. B<=7 is not a requirement. asm5 uses -A1 -B19 -O39,81 -E2,1. If you have the sequences, I can give a try.

elcortegano commented 3 years ago

Thank you for the clarifications, and for offering giving it a look. I have been trying different combinations of parameters following the above rule, but with no success.

Shall I send you the files to an email address? apparently I cannot attach them here.

ilikvlad commented 3 years ago

Hello,

@elcortegano I would like to ask, how did you obtain such a nice view of the alignment? I guess using some software (?) and fetching it bam/sam format coming out of the unimap? Thank you for your response.

elcortegano commented 3 years ago

Hi @ilikvlad , this is with IGV (https://software.broadinstitute.org/software/igv/download).