marbl / Winnowmap

Long read / genome alignment software
Other
235 stars 22 forks source link

Segmentation fault when mapping from index #29

Closed lucblassel closed 2 years ago

lucblassel commented 2 years ago

Hello,

I am trying to map simulated reads to an indexed reference. To do this I take the following steps:

# Kmer counting
meryl count k=15 output merylDB reference.fasta
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

# Index creation
winnowmap -W repetitive_k15.txt -k 15 -d reference.index reference.fasta

# mapping
winnowmap -W repetititve_k15.txt -c reference.index reads.fasta > mapping.paf

The mapping starts and then is killed by a segmentation fault, here is the log:

[M::main::0.119*0.75] loaded/built the index for 1 target sequence(s)
[M::main::0.119*0.75] running winnowmap in SV-aware mode
[M::main::0.119*0.75] stage1-specific parameters minP:1000, incP:4.00, maxP:16000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::0.119*0.75] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.138*0.78] distinct minimizers: 1212310 (87.49% are singletons); average occurrences: 1.474; average spacing: 25.230
zsh: segmentation fault  winnowmap -W repetitive_k15.txt -c reference.index reads.fasta > mapping.paf

I have tried several configurations for the mapping stage:

And I have tried on 2 different references, simulating 1000 reads with nanosim each time.

In all the cases I get a segmentation fault when mapping to the index.

I tried the same thing with minimap2 using the following commands and everything works as expected:

minimap2 -k 15 -d reference.minimap.index reference.fasta
minimap2 -c reference.minimap.index reads.fasta > mapping.minimap.paf

Software versions:

Am I doing something wrong or is this a software issue ? Thanks in advance

cjain7 commented 2 years ago

Hi @lucblassel , Thanks for reporting this issue. Actually, winnowmap doesn't support separate indexing (reason below*). You can combine indexing and mapping steps. Since indexing time is low for a human genome (only about 2-3 minutes), there is not much to gain by saving the index in a file.

meryl count k=15 output merylDB reference.fasta
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
winnowmap -W repetitive_k15.txt -k 15 -c reference.fasta reads.fasta > mapping.paf

*While indexing, winnowmap also builds a data structure to store frequently occurring k-mers in the reference sequence. While dumping index to a file, this information is not saved unfortunately. The seg-fault occurred because it could not find this structure in memory during mapping. Running indexing + mapping in a single step will resolve this.

lucblassel commented 2 years ago

Thank you very much for the quick answer @cjain7!

Since I am running the mapping step in parallel many times to the same reference I figured computing the index once would save time on the long run.

Would it be complicated to dump the frequent k-mer data structure with/in the index somehow ?

cjain7 commented 2 years ago

Would it be complicated to dump the frequent k-mer data structure with/in the index somehow ?

No, I can try to look into this if I find time.