marbl / verkko

265 stars 27 forks source link

AlignTips and genome size #252

Closed diego-rt closed 1 month ago

diego-rt commented 1 month ago

Hello,

I have a few questions regarding the Verkko and the alignTips step:

  1. I'm trying to assemble a very large genome of ~30 GB. Fortunately, verkko has behaved very well and assembled it without major issues. However, I did notice the alignTips step was taking particularly long, and upon checking the logs I realised that winnowmap was generating a multipart index since the genome size is bigger than 4G. I edited the -I flag locally to -I 40G, but it would be nice if it were set by default to a larger value or at least there was a command line option in Verkko to adjust it.
  2. I don't understand why winnowmap is used without the high frequency K-mers counts file as input. If it's not provided, can it still align to highly repetitive regions? I assume the high frequency kmers are not downweighted but also not dropped if done this way?

Thanks a lot in advance for the clarifications!

skoren commented 1 month ago
  1. yes that is a good idea, the normal diploid human genome usually just under 4g (in HPC space) but I think it makes sense to increase the default. Do you have an estimate of how much memory the runs required with -I 40G vs 4G? I'll do some testing with this on my end.

  2. In this case winnowmap will not suppress/downweight any k-mers, it can still align to repetitive regions. In my testing on human and mammal genomes it didn't make much difference in speed given how few reads we are aligning with it so the cost of finding the k-mers was equal to the extra mapping time or greater. Would you be willing to test this on your genome? You'd want to count the k-mers from the compressed fasta that winnowmap is mapping to with meryl (part of verkko) and pass the high count ones to winnowmap on one partition compared to having no k-mers.

There is a fix in v2.1 to reduce the number of reads recruited for the winnowmap alignment step which speeds it up significantly on genomes with very homozygous or heterozygous genomes (essentially those where the initial graph had large nodes from the 1-buildGraph step). Depending on what your graph looks like in that step, this could also help your runtime.

diego-rt commented 1 month ago

Super, thanks for the quick feedback.

So the alignTipsONT process took ~36 min and used ~66G with -I 40G. With the default 4G, it ran for more than 8 hours before I killed it and used 50G. So not a big difference in memory but a big difference in runtime (plus the corruption of the mapqs values with the multipart index). Anyway I think the flag is described as something like "load UP TO X bases" so I dont think it will be particularly detrimental when used on a normal sized genome.

I dont think the runtime of this step is excessive when the -I is set appropriately. Thanks for the explanation on the high frequency kmers. I just didnt understand how it was supposed to behave and I agree it makes sense that the meryl step would take longer than the few minutes you would save in the alignment.

I just checked and the following meryl command took took 21 min and 3.5G memory on the 28G tips.fasta.

meryl count k=15 output merylDB tips.fasta

meryl print greater-than distinct=0.9998 merylDB > kmers.txt

On the other hand, the following winnowmap command with kmers included took 42 min and 68G memory with 24 threads:

winnowmap -t 24 -I 40G -W kmers.txt -cx map-ont tips.fasta ont005.fasta.gz |sed s/de:f://g |awk -F "\t" '{ if ($12 >= 20 && $4-$3 > 5000 && 1-$21 >= 0.9) { if (match($5, "-")) print $1"\t"$2"\t"$3"\t"$4"\t+\t<"$6"\t"$7"\t"$7-$9"\t"$7-$8"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21; else print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t>"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21 }}' > test.gaf
INFO:    Using cached SIF image
[M::mm_idx_gen::0.002*12.83] reading downweighted kmers
[M::mm_idx_gen::0.004*5.02] collected downweighted kmers, no. of kmers read=1914
[M::mm_idx_gen::0.005*4.45] saved the kmers in a bloom filter: hash functions=2 and size=27520 
[M::mm_idx_gen::1478.188*1.15] collected minimizers
[M::mm_idx_gen::1486.245*1.25] sorted minimizers
[M::main::1486.246*1.25] loaded/built the index for 396498 target sequence(s)
[M::main::1486.247*1.25] running winnowmap in SV-aware mode
[M::main::1486.247*1.25] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::1486.247*1.25] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 396498
[M::mm_idx_stat::1486.279*1.25] distinct minimizers: 1764520 (7.98% are singletons); average occurrences: 660.631; average spacing: 25.487
[M::worker_pipeline::2217.095*15.76] mapped 14398 sequences
[M::worker_pipeline::2556.319*19.55] mapped 7820 sequences
[M::main] Version: 2.03, pthreads=24, omp_threads=3
[M::main] CMD: /users/diego.terrones/miniforge3/envs/verkko/bin/winnowmap -t 24 -I 40G -W kmers.txt -cx map-ont tips.fasta ont005.fasta.gz
[M::main] Real time: 2556.760 sec; CPU: 49979.889 sec; Peak RSS: 68.487 GB

Strangely, running it without the kmers was actually much faster?

winnowmap -t 24 -I 40G -cx map-ont tips.fasta ../3-alignTips/split/ont005.fasta.gz |sed s/de:f://g |awk -F "\t" '{ if ($12 >= 20 && $4-$3 > 5000 && 1-$21 >= 0.9) { if (match($5, "-")) print $1"\t"$2"\t"$3"\t"$4"\t+\t<"$6"\t"$7"\t"$7-$9"\t"$7-$8"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21; else print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t>"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21 }}' > ../3-alignTips/aligned005.WORKING.gaf && mv -f ../3-alignTips/aligned005.WORKING.gaf ../3-alignTips/aligned005.gaf
[M::mm_idx_gen::0.001*3.74] reading downweighted kmers
[M::mm_idx_gen::0.001*3.21] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::0.001*3.20] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::1113.526*1.22] collected minimizers
[M::mm_idx_gen::1122.723*1.38] sorted minimizers
[M::main::1122.723*1.38] loaded/built the index for 396498 target sequence(s)
[M::main::1122.723*1.38] running winnowmap in SV-aware mode
[M::main::1122.723*1.38] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::1122.723*1.38] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 396498
[M::mm_idx_stat::1122.787*1.38] distinct minimizers: 1707490 (7.60% are singletons); average occurrences: 680.950; average spacing: 25.552
[M::worker_pipeline::1838.285*18.63] mapped 14398 sequences
[M::worker_pipeline::2177.726*22.78] mapped 7820 sequences
[M::main] Version: 2.03, pthreads=24, omp_threads=3
[M::main] CMD: /users/diego.terrones/miniforge3/envs/verkko/bin/winnowmap -t 24 -I 40G -cx map-ont tips.fasta ../3-alignTips/split/ont005.fasta.gz
[M::main] Real time: 2179.091 sec; CPU: 49613.127 sec; Peak RSS: 68.131 GB

So yeah, totally not worth it to do the kmer counting step.

skoren commented 1 month ago

Do you have a custom version of winnowmap? As far as I see from this issue (https://github.com/marbl/Winnowmap/issues/40) it doesn't support the index size parameter and when I provide the -I flag it doesn't run.

diego-rt commented 1 month ago

Yes I was also confused by that issue. In practice, the -I flag has worked for me and i'm using the standard latest version. The only caveat is that precomputing indexes is not working and you need to index on the fly every time.

diego-rt commented 1 month ago

I'm running winnowmap (from verkko conda env) without the -I flag and now its only loading a part of the target sequences.

This is the content of the tips.fasta file:

(verkko) [diego.terrones@clip-login-0 test]$ seqkit stat -a tips.fasta 
file        format  type  num_seqs         sum_len  min_len   avg_len     max_len     Q1      Q2      Q3  sum_gap        N50  Q20(%)  Q30(%)  GC(%)
tips.fasta  FASTA   DNA    396,498  29,709,964,386    1,284  74,930.9  26,840,996  4,707  16,511  30,223        0  1,110,928       0       0  46.87

This is the command, now without the -I:

/users/diego.terrones/miniforge3/envs/verkko/bin/winnowmap -t 24 -cx map-ont tips.fasta ont005.fasta.gz |sed s/de:f://g |awk -F "\t" '{ if ($12 >= 20 && $4-$3 > 5000 && 1-$21 >= 0.9) { if (match($5, "-")) print $1"\t"$2"\t"$3"\t"$4"\t+\t<"$6"\t"$7"\t"$7-$9"\t"$7-$8"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21; else print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t>"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$15"\tdv:f:"$21"\tid:f:"1-$21 }}' > test.gaf

This is the output so far. Note the 78310 target sequence(s) part and so forth. It is clearly running a multipart index and aligning the reads to each sequence chunk:

[M::mm_idx_gen::0.003*4.16] reading downweighted kmers
[M::mm_idx_gen::0.004*3.44] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::0.004*3.08] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::149.105*1.19] collected minimizers
[M::mm_idx_gen::150.756*1.33] sorted minimizers
[M::main::150.757*1.33] loaded/built the index for 78310 target sequence(s)
[M::main::150.757*1.33] running winnowmap in SV-aware mode
[M::main::150.758*1.33] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::150.758*1.33] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 78310
[M::mm_idx_stat::150.778*1.33] distinct minimizers: 1338453 (8.14% are singletons); average occurrences: 116.999; average spacing: 25.556
[M::worker_pipeline::2697.418*43.05] mapped 14398 sequences
[M::worker_pipeline::4037.396*43.83] mapped 7820 sequences
[M::mm_idx_gen::4037.593*43.82] reading downweighted kmers
[M::mm_idx_gen::4037.594*43.82] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::4037.594*43.82] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::4190.258*42.27] collected minimizers
[M::mm_idx_gen::4191.336*42.27] sorted minimizers
[M::main::4191.337*42.27] loaded/built the index for 77520 target sequence(s)
[M::main::4191.338*42.27] running winnowmap in SV-aware mode
[M::main::4191.338*42.27] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::4191.338*42.27] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 77520
[M::mm_idx_stat::4191.359*42.27] distinct minimizers: 1342591 (8.29% are singletons); average occurrences: 117.877; average spacing: 25.547
[M::worker_pipeline::6724.057*43.63] mapped 14398 sequences
[M::worker_pipeline::8001.654*43.99] mapped 7820 sequences
[M::mm_idx_gen::8001.919*43.98] reading downweighted kmers
[M::mm_idx_gen::8001.920*43.98] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::8001.920*43.98] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::8150.407*43.20] collected minimizers
[M::mm_idx_gen::8151.428*43.20] sorted minimizers
[M::main::8151.429*43.20] loaded/built the index for 71650 target sequence(s)
[M::main::8151.437*43.20] running winnowmap in SV-aware mode
[M::main::8151.437*43.20] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::8151.438*43.20] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 71650
[M::mm_idx_stat::8151.461*43.20] distinct minimizers: 1356854 (9.64% are singletons); average occurrences: 115.482; average spacing: 25.544
[M::worker_pipeline::10673.517*43.78] mapped 14398 sequences
[M::worker_pipeline::11928.832*43.95] mapped 7820 sequences
[M::mm_idx_gen::11929.426*43.95] reading downweighted kmers
[M::mm_idx_gen::11929.427*43.95] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::11929.427*43.95] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::12078.846*43.42] collected minimizers
[M::mm_idx_gen::12079.831*43.41] sorted minimizers
[M::main::12079.832*43.41] loaded/built the index for 47527 target sequence(s)
[M::main::12079.833*43.41] running winnowmap in SV-aware mode
[M::main::12079.833*43.41] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::12079.834*43.41] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 47527
[M::mm_idx_stat::12079.857*43.41] distinct minimizers: 1367952 (11.23% are singletons); average occurrences: 114.744; average spacing: 25.543
[M::worker_pipeline::14352.954*43.77] mapped 14398 sequences
[M::worker_pipeline::15527.381*43.87] mapped 7820 sequences
[M::mm_idx_gen::15527.867*43.87] reading downweighted kmers
[M::mm_idx_gen::15527.868*43.87] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::15527.868*43.87] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::15676.504*43.47] collected minimizers
[M::mm_idx_gen::15677.467*43.47] sorted minimizers
[M::main::15677.468*43.47] loaded/built the index for 18828 target sequence(s)
[M::main::15677.468*43.47] running winnowmap in SV-aware mode
[M::main::15677.469*43.47] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::15677.470*43.47] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 18828
[M::mm_idx_stat::15677.493*43.47] distinct minimizers: 1390598 (12.32% are singletons); average occurrences: 113.216; average spacing: 25.535
[M::worker_pipeline::18148.887*43.76] mapped 14398 sequences
[M::worker_pipeline::19442.939*43.85] mapped 7820 sequences
[M::mm_idx_gen::19443.521*43.85] reading downweighted kmers
[M::mm_idx_gen::19443.522*43.85] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::19443.522*43.85] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::19591.711*43.53] collected minimizers
[M::mm_idx_gen::19592.845*43.52] sorted minimizers
[M::main::19592.845*43.52] loaded/built the index for 16971 target sequence(s)
[M::main::19592.846*43.52] running winnowmap in SV-aware mode
[M::main::19592.846*43.52] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::19592.847*43.52] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 16971
[M::mm_idx_stat::19592.869*43.52] distinct minimizers: 1382165 (12.22% are singletons); average occurrences: 114.142; average spacing: 25.559
[M::worker_pipeline::22339.824*43.74] mapped 14398 sequences
[M::worker_pipeline::23730.660*43.82] mapped 7820 sequences
[M::mm_idx_gen::23731.160*43.82] reading downweighted kmers
[M::mm_idx_gen::23731.162*43.82] collected downweighted kmers, no. of kmers read=0
[M::mm_idx_gen::23731.162*43.82] saved the kmers in a bloom filter: hash functions=2 and size=14384 
[M::mm_idx_gen::23881.757*43.55] collected minimizers
[M::mm_idx_gen::23882.684*43.55] sorted minimizers
[M::main::23882.685*43.55] loaded/built the index for 42938 target sequence(s)
[M::main::23882.685*43.55] running winnowmap in SV-aware mode
[M::main::23882.686*43.55] stage1-specific parameters minP:2000, incP:2.83, maxP:16000, sample:2000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::23882.686*43.55] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 42938
[M::mm_idx_stat::23882.709*43.55] distinct minimizers: 1360005 (11.51% are singletons); average occurrences: 116.261; average spacing: 25.581

So the confusing things relative to that thread are:

  1. The -I flag does indeed work because all target sequences are loaded at once
  2. Multi part indexing also seems to work (?) or at least the alignment is split into chunks of target sequences
skoren commented 1 month ago

I confirmed that adjusting the -I makes a huge performance difference (something like 10x speedup) on genomes > 4gb. I do see slight differences in the alignments but mostly it seems that duplicates are removed that were present before. It seems the -I flag was removed post the 2.0.3 release so I'm going to confirm that it is expected to work OK without a saved index and then update verkko.

diego-rt commented 1 month ago

Wow, that's bizarre. And yeah, not only is it faster but without it, all the mapQs and supplementary/secondary flags will be totally wrong when aligning to genomes >4G.

skoren commented 1 month ago

I've done tests on a couple of genomes and it not only maps much faster but also improves the assembly, thanks for pointing this out.