Closed ghost closed 6 months ago
190 cores is quite a lot and there is some fixed overhead per thread. I wouldn't recommend using more than 32/64 cores for this step.
That said, 20 days seems very long. I don't think I've ever seen a mashmap run take that long on any sample. How large is your genome and what data types/coverage are you using as the inputs to verkko?
The genome is 8Gb, and 100x HIFI + 20x Ont (N50 100k) + 50x HiC. Here is the log file run_mash.err
.
run_mashmap.err.csv
In order upload the file, I added the csv suffix in the file name
What is the size of the current assembly.fasta in your folder? What are the contig sizes there? Have you been able to run increasing cores to 32 or 48?
the size of file asm/8-hicPipeline/unitigs.fasta
is 11G and the size of file asm/8-hicPipeline/unitigs.hpc.fasta
is 7.4G. I try to increase the cores to 32, but it still take long time to finish the step. I splited the unitigs.hpc.fasta into 40 parts and run the mashmap step, then I want to use the output to replace the mashmap.out. I am not sure if doing this is possible. If it is, what other files do I need to adjust?
It should be OK to run the step yourself and merge, you might have to run with "--snakeopts --touch" to make sure snakemake uses the new file and doesn't try to re-compute it. @Dmitry-Antipov can you take a look at this issue?
I've checked the mashmap log, and I do not like the (autodetected) sketch size. Also minmer windows count is unexpectedly huge - ~100 times larger than I see on human and primate runs.
How much sequences do you have in your unitigs.hpc.fasta? Possibly there are too much of really short nodes?
You can merge those separate runs, but I'm not sure that mashmap's results will be reasonable. May be this should be reported to mashmap authors?
It should be OK to run the step yourself and merge, you might have to run with "--snakeopts --touch" to make sure snakemake uses the new file and doesn't try to re-compute it. @Dmitry-Antipov can you take a look at this issue?
Thank you, I will try it.
I've checked the mashmap log, and I do not like the (autodetected) sketch size. Also minmer windows count is unexpectedly huge - ~100 times larger than I see on human and primate runs.
How much sequences do you have in your unitigs.hpc.fasta? Possibly there are too much of really short nodes?
You can merge those separate runs, but I'm not sure that mashmap's results will be reasonable. May be this should be reported to mashmap authors?
I don't know the number of sequences on human and primate runs. And there are so many short sequences on my assembly as you guessed. I'm not sure if these short sequences affect the final assembly. Here is the statistics:
format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) AvgQual GC(%)
asm/8-hicPipeline/unitigs.hpc.fasta FASTA DNA 35,488 7,783,898,150 17 219,338.9 137,786,981 6,461.5 12,316 16,749 0 8,794,335 0 0 0 47.61
Yet another problem I missed first time I've looked on the log - identity threshold is really low (80%). Mashmap performance significantly decreases for low identity threshold/high haplotype divergence (--haplo-divergence verkko option). I can suggest to decrease --haplo-divergence (i.e. use default 0.05). If the problem remains, I'd suggest to create an issue for MashMap (https://github.com/marbl/MashMap/issues)
Hi, when my program run to the step8-hicpipeline, I found the mashmap only use 8 threads and it is expected to take more than 20 days. So I modified the verkko,sh and changed the papratemeters to
fhc_n_cpus=190 fhc_mem_gb=1200
. Then the program is expected to take 2 days to finish the step, but the memory out of my limit and archieved the 1.5T in one day.How should I set the pamameters to balance time and memory consumption ?