lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.77k stars 405 forks source link

Mappy segfault with long sequence and MD=True #1183

Open davidnewman02 opened 5 months ago

davidnewman02 commented 5 months ago

I have a very long read (~500kb) which I try to align using mappy in python using the MD=True flag and this gives a Segfault. This read does not align well, with a large amount of soft-clipping.

This does not appear to happen with the minimap2 standalone executable or when MD=False

read.fastq.txt NB The full reference file is too large for a gitlab issue, but is T2T-v1.0 with a small amount of local cleanup: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.fasta

from pathlib import Path

import mappy
print(f"Mappy version: {mappy.__version__}")

from mappy import Aligner, ThreadBuffer

ref = "normal_chr.v1.0.fasta.gz"
print("Loading reference")
aligner = Aligner(ref, preset='map-ont')

print("Loading sequence")
fastq = Path("read.fastq.txt").read_text()
lines = fastq.split('\n')
read_id = lines[0][1:-1]
seq = lines[1][1:-1]

print("Running alignment")
res = next(aligner.map(seq, MD=True))
print(res)

output:

Mappy version: 2.27
Loading reference
Loading sequence
Running alignment
Segmentation fault (core dumped)

The equivalent with standalone minimap2 runs successfully:

$ minimap2 normal_chr.v1.0.fasta.gz read.fastq.txt -ax map-ont -t 40 --MD > out.sam

[M::mm_idx_gen::113.331*1.92] collected minimizers
[M::mm_idx_gen::120.691*3.24] sorted minimizers
[M::main::120.691*3.24] loaded/built the index for 48 target sequence(s)
[M::mm_mapopt_update::122.395*3.21] mid_occ = 1671
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 48
[M::mm_idx_stat::123.104*3.19] distinct minimizers: 101142495 (3.49% are singletons); average occurrences: 11.113; average spacing: 5.338; total length: 5999703092
[M::worker_pipeline::178.461*2.51] mapped 1 sequences
[M::main] Version: 2.27-r1207-dirty
[M::main] CMD: ./minimap2 -ax map-ont -t 40 --MD normal_chr.v1.0.fasta.gz read.fastq.txt
[M::main] Real time: 178.982 sec; CPU: 449.128 sec; Peak RSS: 22.010 GB
lh3 commented 5 months ago

This is probably caused by the same bug #1177 and/or #1181. I will cut a new release later this week.

davidnewman02 commented 5 months ago

I am able to reproduce this bug in earlier mappy versions (tested back to 2.18), so I think this might be a different root cause as those issues appear to be due to more recent code additions.

EDIT: I have just downloaded the latest mappy=2.28 and can confirm the Segfault is still present for this read.