lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 414 forks source link

mappy `.seq` memory leak from FASTA input #705

Open marcus1487 opened 3 years ago

marcus1487 commented 3 years ago

I've experienced a memory leak when using the python mappy.Aligner when initialized with a sequence file/FASTA (this is within Megalodon). This same memory leak does not occur when the Aligner is initialized with a minimap2 index. The problem seems exaggerated when the reference sequence is quite large (e.g. human genome) and when the reference is being accessed from many aligners (in different threads) at the same time.

I suspect the memory leak may be due to extensive use of the .seq method of the mappy.Aligner within Megalodon. The reference sequence for each mapping is extracted within Megalodon.

Ideally the source of the memory leak could be identified and fixed, but as a stop gap I was hoping to warn users that using a sequence/fasta reference instead of the minimap2 index could lead to a memory error. When I use mappy.fastx_read function on a minimap index file I get a single "contig" with an empty string for the contig name and sequence (list(mappy.fastx_read('ref.fa.mmi') gives: [('', '', None)]). I could check that this sequence is empty to determine if the input file was a FASTA or minimap2 index, but I was wondering if there might be a more robust way to check this?

Thanks for any input and especially for continued development on this project!

lh3 commented 3 years ago

The .seq method is fairly simple. I couldn't identify a leak there.

mappy.fastx_read read FASTA/FASTQ files. It doesn't read a minimap2 index.