lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.77k stars 405 forks source link

minimap2 large memory consumption inside souporcell pipeline #1054

Open clementhelsens opened 1 year ago

clementhelsens commented 1 year ago

Hello,

following this issue: https://github.com/wheaton5/souporcell/issues/179

minimap is used in souporcell pipeline like: https://github.com/wheaton5/souporcell/blob/master/souporcell_pipeline.py#L255#L257

I am out of ideas and thus reach out in case you have any suggestions. I could provide the problematic input file in case you could have a look. Thanks a lot in advance!

lh3 commented 1 year ago

What are the reference and the query sequences?

clementhelsens commented 1 year ago

The reference is danio_rerio from ensembl primary assembly and for the query sequences In this google drive you have the reference genome, the input fastq file and the minimap error:

I use version 2.17-r941 running command:

minimap2 -ax splice -t 20 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa tmp_13_0.fq

and the minimap error is:

[M::mm_idx_gen::24.271*1.62] collected minimizers
[M::mm_idx_gen::26.108*2.47] sorted minimizers
[M::main::26.109*2.47] loaded/built the index for 993 target sequence(s)
[M::mm_mapopt_update::26.109*2.47] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 993
[M::mm_idx_stat::27.537*2.39] distinct minimizers: 155448698 (92.45% are singletons); average occurrences: 1.508; average spacing: 5.859
[M::worker_pipeline::31.378*3.90] mapped 555556 sequences
Killed
lh3 commented 1 year ago

Could you try the latest version?

lh3 commented 1 year ago

Your fastq is coordinate sorted. This may cause various problems with both bwa and minimap2. Please randomize the input bam with samtools collate like this example.

clementhelsens commented 1 year ago

hello @lh3 , thanks for the suggestions! Looks like indeed I was using an old version of minimap2. I just tried with 2.26 and it worked just fine, the memory consumption was reasonable and stable!

lh3 commented 1 year ago

Thanks for confirmation. Minimap2 is not developed for short-read RNA-seq, mostly because it can't handle paired-end reads. How does souporcell use minimap2? Does it work?

clementhelsens commented 1 year ago

hello @lh3, I can not answer in detail your question, thus let me tag the main developer @wheaton5

wheaton5 commented 1 year ago

I have used minimap2 for short read scRNAseq with custom parameters. It does not deal with paired end reads as scRNAseq cDNA reads only come from one read. The other read has the UMI and cell barcode etc. I dont think this should be of concern to you. I should deal with this within my own project. I have an option for no remapping or for hisat2. Still, thanks for responding. Fyi I was advised by durbin for my phd and I very much look up to you. I dont want to take up your valuable time on this issue that I should resolve myself.

wheaton5 commented 1 year ago

@clementhelsens lets work together to either use the skip remapping option or use hisat2 which is optimized for short read rnaseq data

wheaton5 commented 1 year ago

Thanks for your response about randomization. This seems like an easy fix on my end. The only reason it is sorted is that the input is a sorted bam but the mapping for rnaseq is very poorly optimized for variant calling. Initially i used minimap to fix this. Later i found hisat2 is better for short read rnaseq (as you said, this is not the intention for minimap2).

clementhelsens commented 1 year ago

Sure! thanks @wheaton5, but as I mentioned in the thread, following @lh3 suggestion, using a more recent version of minimap2 solved the very large memory consumption issue, but I don't know about the reproducibility. Let's follow up on souporcell issue for alternative options. I'm happy to test on my end with my use case which is similar to Fig 1-A here