marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
294 stars 29 forks source link

memory usage of MashMap in step8-hicpipeline #216

Closed ghost closed 6 months ago

ghost commented 10 months ago

Hi, when my program run to the step8-hicpipeline, I found the mashmap only use 8 threads and it is expected to take more than 20 days. So I modified the verkko,sh and changed the papratemeters to fhc_n_cpus=190 fhc_mem_gb=1200. Then the program is expected to take 2 days to finish the step, but the memory out of my limit and archieved the 1.5T in one day.

How should I set the pamameters to balance time and memory consumption ?

skoren commented 10 months ago

190 cores is quite a lot and there is some fixed overhead per thread. I wouldn't recommend using more than 32/64 cores for this step.

That said, 20 days seems very long. I don't think I've ever seen a mashmap run take that long on any sample. How large is your genome and what data types/coverage are you using as the inputs to verkko?

ghost commented 10 months ago

The genome is 8Gb, and 100x HIFI + 20x Ont (N50 100k) + 50x HiC. Here is the log file run_mash.err . run_mashmap.err.csv

In order upload the file, I added the csv suffix in the file name

skoren commented 10 months ago

What is the size of the current assembly.fasta in your folder? What are the contig sizes there? Have you been able to run increasing cores to 32 or 48?

ghost commented 10 months ago

the size of file asm/8-hicPipeline/unitigs.fasta is 11G and the size of file asm/8-hicPipeline/unitigs.hpc.fasta is 7.4G. I try to increase the cores to 32, but it still take long time to finish the step. I splited the unitigs.hpc.fasta into 40 parts and run the mashmap step, then I want to use the output to replace the mashmap.out. I am not sure if doing this is possible. If it is, what other files do I need to adjust?

skoren commented 10 months ago

It should be OK to run the step yourself and merge, you might have to run with "--snakeopts --touch" to make sure snakemake uses the new file and doesn't try to re-compute it. @Dmitry-Antipov can you take a look at this issue?

Dmitry-Antipov commented 10 months ago

I've checked the mashmap log, and I do not like the (autodetected) sketch size. Also minmer windows count is unexpectedly huge - ~100 times larger than I see on human and primate runs.

How much sequences do you have in your unitigs.hpc.fasta? Possibly there are too much of really short nodes?

You can merge those separate runs, but I'm not sure that mashmap's results will be reasonable. May be this should be reported to mashmap authors?

ghost commented 10 months ago

It should be OK to run the step yourself and merge, you might have to run with "--snakeopts --touch" to make sure snakemake uses the new file and doesn't try to re-compute it. @Dmitry-Antipov can you take a look at this issue?

Thank you, I will try it.

ghost commented 10 months ago

I've checked the mashmap log, and I do not like the (autodetected) sketch size. Also minmer windows count is unexpectedly huge - ~100 times larger than I see on human and primate runs.

How much sequences do you have in your unitigs.hpc.fasta? Possibly there are too much of really short nodes?

You can merge those separate runs, but I'm not sure that mashmap's results will be reasonable. May be this should be reported to mashmap authors?

I don't know the number of sequences on human and primate runs. And there are so many short sequences on my assembly as you guessed. I'm not sure if these short sequences affect the final assembly. Here is the statistics:

format  type  num_seqs        sum_len  min_len    avg_len      max_len       Q1      Q2      Q3  sum_gap        N50  Q20(%)  Q30(%)  AvgQual  GC(%)
asm/8-hicPipeline/unitigs.hpc.fasta  FASTA   DNA     35,488  7,783,898,150       17  219,338.9  137,786,981  6,461.5  12,316  16,749        0  8,794,335       0       0        0  47.61
Dmitry-Antipov commented 8 months ago

Yet another problem I missed first time I've looked on the log - identity threshold is really low (80%). Mashmap performance significantly decreases for low identity threshold/high haplotype divergence (--haplo-divergence verkko option). I can suggest to decrease --haplo-divergence (i.e. use default 0.05). If the problem remains, I'd suggest to create an issue for MashMap (https://github.com/marbl/MashMap/issues)