Closed jelber2 closed 4 years ago
Hi,
The asm20
option is not configured with winnowmap as of yet (work in progress).
Can you try -ax map-pb
option?
Yes, I tried the -ax map-pb
option, and it is only about twice as long run time
# winnowmap with -ax asm20 setting
[M::mm_idx_gen::0.000*35196.57] reading downweighted kmers
[M::mm_idx_gen::0.011*460.17] collected downweighted kmers, no. of kmers read=2052
[M::mm_idx_gen::0.011*459.31] saved the kmers in a bloom filter: hash functions=13 and size=39344
[M::mm_idx_gen::38.766*1.34] collected minimizers
[M::mm_idx_gen::41.233*1.80] sorted minimizers
[M::main::41.243*1.80] loaded/built the index for 1100 target sequence(s)
[M::mm_mapopt_update::41.243*1.80] mid_occ = 2147483647
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1100
[M::mm_idx_stat::42.744*1.78] distinct minimizers: 26613163 (71.74% are singletons); average occurrences: 1.956; average spacing: 8.482
[M::worker_pipeline::2989.633*39.61] mapped 41606 sequences
[M::worker_pipeline::5960.186*38.36] mapped 41608 sequences
[M::worker_pipeline::8919.402*38.85] mapped 41605 sequences
[M::worker_pipeline::12131.488*37.96] mapped 41594 sequences
[M::worker_pipeline::15060.295*38.38] mapped 41595 sequences
[M::worker_pipeline::18182.216*37.93] mapped 41612 sequences
[M::worker_pipeline::20653.108*38.66] mapped 41612 sequences
[M::worker_pipeline::23789.089*38.45] mapped 41601 sequences
[M::worker_pipeline::26951.418*38.49] mapped 41605 sequences
[M::worker_pipeline::29914.825*38.49] mapped 41615 sequences
[M::worker_pipeline::32652.492*38.75] mapped 41612 sequences
[M::worker_pipeline::35701.262*38.96] mapped 41603 sequences
[M::worker_pipeline::38502.260*39.01] mapped 41601 sequences
[M::worker_pipeline::41522.480*38.90] mapped 41598 sequences
[M::worker_pipeline::44345.592*38.75] mapped 41611 sequences
[M::worker_pipeline::47242.143*38.83] mapped 41604 sequences
[M::worker_pipeline::49809.499*39.17] mapped 41589 sequences
[M::worker_pipeline::52915.746*39.08] mapped 41618 sequences
[M::worker_pipeline::56579.053*38.57] mapped 41618 sequences
[M::worker_pipeline::59157.913*38.81] mapped 41602 sequences
[M::worker_pipeline::61374.922*39.32] mapped 41601 sequences
[M::worker_pipeline::64334.709*39.29] mapped 41624 sequences
[M::worker_pipeline::67499.641*39.17] mapped 41599 sequences
[M::worker_pipeline::70380.777*39.15] mapped 41604 sequences
[M::worker_pipeline::70488.787*39.11] mapped 1463 sequences
[M::main] Version: 1.0
[M::main] CMD: /genetics/elbers/Winnowmap/winnowmap -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz
[M::main] Real time: 70489.272 sec; CPU: 2757145.812 sec; Peak RSS: 310.024 GB
real 1174m55.381s
user 45012m15.796s
sys 940m10.956s
# minimap2 with -ax map-pb preset instead of -ax asm20
[M::mm_idx_gen::15.607*1.56] collected minimizers
[M::mm_idx_gen::18.018*2.64] sorted minimizers
[M::main::18.019*2.64] loaded/built the index for 1100 target sequence(s)
[M::mm_mapopt_update::18.952*2.56] mid_occ = 175
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 1100
[M::mm_idx_stat::19.615*2.51] distinct minimizers: 26605237 (71.77% are singletons); average occurrences: 1.995; average spacing: 8.319
[M::worker_pipeline::46.379*16.42] mapped 41606 sequences
[M::worker_pipeline::67.743*21.20] mapped 41608 sequences
[M::worker_pipeline::87.438*24.76] mapped 41605 sequences
[M::worker_pipeline::104.838*28.35] mapped 41594 sequences
[M::worker_pipeline::120.227*30.26] mapped 41595 sequences
[M::worker_pipeline::137.584*32.58] mapped 41612 sequences
[M::worker_pipeline::151.957*35.20] mapped 41612 sequences
[M::worker_pipeline::163.811*36.32] mapped 41601 sequences
[M::worker_pipeline::180.690*36.94] mapped 41605 sequences
[M::worker_pipeline::198.201*38.00] mapped 41615 sequences
[M::worker_pipeline::211.680*38.99] mapped 41612 sequences
[M::worker_pipeline::226.701*40.66] mapped 41603 sequences
[M::worker_pipeline::238.456*40.91] mapped 41601 sequences
[M::worker_pipeline::257.109*42.73] mapped 41598 sequences
[M::worker_pipeline::265.445*42.56] mapped 41611 sequences
[M::worker_pipeline::284.811*43.21] mapped 41604 sequences
[M::worker_pipeline::296.548*43.28] mapped 41589 sequences
[M::worker_pipeline::319.236*44.32] mapped 41618 sequences
[M::worker_pipeline::326.389*43.94] mapped 41618 sequences
[M::worker_pipeline::348.654*44.57] mapped 41602 sequences
[M::worker_pipeline::357.071*44.47] mapped 41601 sequences
[M::worker_pipeline::378.184*45.38] mapped 41624 sequences
[M::worker_pipeline::385.524*45.24] mapped 41599 sequences
[M::worker_pipeline::406.795*44.81] mapped 41604 sequences
[M::worker_pipeline::406.952*44.80] mapped 1463 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -t 75 -H -ax map-pb peregrine.fasta random_reads_ccs.fastq.gz
[M::main] Real time: 407.409 sec; CPU: 18230.212 sec; Peak RSS: 8.474 GB
real 6m48.695s
user 295m54.068s
sys 7m56.748s
# winnowmap with -ax map-pb preset instead of -ax asm20
[M::mm_idx_gen::0.000*44845.67] reading downweighted kmers
[M::mm_idx_gen::0.002*2623.30] collected downweighted kmers, no. of kmers read=2052
[M::mm_idx_gen::0.002*2608.78] saved the kmers in a bloom filter: hash functions=13 and size=39344
[M::mm_idx_gen::35.847*1.29] collected minimizers
[M::mm_idx_gen::36.419*1.36] sorted minimizers
[M::main::36.419*1.36] loaded/built the index for 1100 target sequence(s)
[M::mm_mapopt_update::36.419*1.36] mid_occ = 2147483647
[M::mm_idx_stat] kmer size: 19; skip: 50; is_hpc: 1; #seq: 1100
[M::mm_idx_stat::36.532*1.36] distinct minimizers: 5987044 (72.90% are singletons); average occurrences: 1.882; average spacing: 39.197
[M::worker_pipeline::90.264*22.64] mapped 41606 sequences
[M::worker_pipeline::118.692*28.46] mapped 41608 sequences
[M::worker_pipeline::146.177*32.43] mapped 41605 sequences
[M::worker_pipeline::169.797*35.69] mapped 41594 sequences
[M::worker_pipeline::201.795*36.95] mapped 41595 sequences
[M::worker_pipeline::235.177*39.00] mapped 41612 sequences
[M::worker_pipeline::255.899*39.42] mapped 41612 sequences
[M::worker_pipeline::282.837*40.66] mapped 41601 sequences
[M::worker_pipeline::320.715*41.04] mapped 41605 sequences
[M::worker_pipeline::352.374*41.33] mapped 41615 sequences
[M::worker_pipeline::398.158*41.56] mapped 41612 sequences
[M::worker_pipeline::419.895*41.48] mapped 41603 sequences
[M::worker_pipeline::442.267*41.81] mapped 41601 sequences
[M::worker_pipeline::468.907*42.44] mapped 41598 sequences
[M::worker_pipeline::504.165*41.78] mapped 41611 sequences
[M::worker_pipeline::530.671*42.27] mapped 41604 sequences
[M::worker_pipeline::553.943*43.12] mapped 41589 sequences
[M::worker_pipeline::583.634*43.36] mapped 41618 sequences
[M::worker_pipeline::616.610*43.01] mapped 41618 sequences
[M::worker_pipeline::642.092*43.44] mapped 41602 sequences
[M::worker_pipeline::665.499*44.15] mapped 41601 sequences
[M::worker_pipeline::692.943*44.35] mapped 41624 sequences
[M::worker_pipeline::718.479*44.64] mapped 41599 sequences
[M::worker_pipeline::745.373*44.27] mapped 41604 sequences
[M::worker_pipeline::746.044*44.23] mapped 1463 sequences
[M::main] Version: 1.0
[M::main] CMD: /genetics/elbers/Winnowmap/winnowmap -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax map-pb peregrine.fasta random_reads_ccs.fastq.gz
[M::main] Real time: 746.350 sec; CPU: 33001.520 sec; Peak RSS: 15.715 GB
real 12m33.958s
user 538m8.556s
sys 11m53.684s
I'll run QUAST-LG and post the differences in another post once that finishes running.
# minimap2 2.17-r941
# winnowmap 1.0 (Github commit d547331 or ee6bca7 - indicated below)
# samtools 1.9
# racon 1.4.11 (Github commit 6ca733a)
# quast 5.0.2 (conda install [Name:quast Version:5.0.2 Build:py36pl526ha92aebf_0 Channel:bioconda])
time minimap2 -t 75 -H -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam
samtools view -@75 -F 1796 -q 20 random_reads_ccs.fastq-against-peregrine.fasta.sam > random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam
racon -u -t 75 random_reads_ccs.fastq.gz random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam peregrine.fasta 2> peregrine.fasta_racon_polished.fasta.log > peregrine.fasta_racon_polished_minimap2_asm20.alignments.fasta
time /genetics/elbers/Winnowmap/winnowmap -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax asm20 peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam
samtools view -@75 -F 1796 -q 20 random_reads_ccs.fastq-against-peregrine.fasta.sam > random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam
racon -u -t 75 random_reads_ccs.fastq.gz random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam peregrine.fasta 2> peregrine.fasta_racon_polished.fasta.log > peregrine.fasta_racon_polished_winnowmap_asm20.alignments.fasta
time minimap2 -t 75 -H -ax map-pb peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam
samtools view -@75 -F 1796 -q 20 random_reads_ccs.fastq-against-peregrine.fasta.sam > random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam
racon -u -t 75 random_reads_ccs.fastq.gz random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam peregrine.fasta 2> peregrine.fasta_racon_polished.fasta.log > peregrine.fasta_racon_polished_minimap2_map-pb.alignments.fasta
time /genetics/elbers/Winnowmap/winnowmap -t 75 -H -W bad_Hk19_mers_peregrine.fasta.txt -ax map-pb peregrine.fasta random_reads_ccs.fastq.gz > random_reads_ccs.fastq-against-peregrine.fasta.sam
samtools view -@75 -F 1796 -q 20 random_reads_ccs.fastq-against-peregrine.fasta.sam > random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam
racon -u -t 75 random_reads_ccs.fastq.gz random_reads_ccs.fastq-against-peregrine.fasta.filtered.sam peregrine.fasta 2> peregrine.fasta_racon_polished.fasta.log > peregrine.fasta_racon_polished_winnowmap_map-pb.alignments.fasta
quast.py --fragmented --fast --large --threads 75 -r curated.fasta --eukaryote -o ./quast_ccs peregrine.fasta peregrine.fasta_racon_polished_winnowmap_*.alignments.fasta peregrine.fasta_racon_polished_minimap2_*.alignments.fasta > ./quast_ccs.log 2>&1
#QUAST-LG results
Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) # contigs (>= 5000 bp) # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp) Total length (>= 1000 bp) Total length (>= 5000 bp) Total length (>= 10000 bp) Total length (>= 25000 bp) Total length (>= 50000 bp) # contigs Largest contig Total length Reference length Reference GC (%) N50 NG50 N75 NG75 L50 LG50 L75 LG75 # misassemblies # misassembled contigs Misassembled contigs length # local misassemblies # scaffold gap ext. mis. # scaffold gap loc. mis. # possible TEs # unaligned mis. contigs # unaligned contigs Unaligned length Genome fraction (%) Duplication ratio # N's per 100 kbp # mismatches per 100 kbp # indels per 100 kbp Largest alignment Total aligned length NA50 NGA50 NA75 NGA75 LA50 LGA50 LA75 LGA75
peregrine.fasta 1100 1100 1100 1100 970 839 441579675 441579675 441579675 441579675 439306196 434374226 1100 5873009 441579675 445568504 27.74 822917 817504 443020 432130 145 148 324 331 206 164 114872736 104 0 0 0 4 3 + 98 part 284887 98.681 1.004 0.00 8.30 8.60 5462309 440705701 741321 737582 383244 379607 166 169 367 375
peregrine.fasta_racon_polished_minimap2_asm20.alignments.fasta 1100 1100 1100 1100 971 839 441602853 441602853 441602853 441602853 439352792 434395087 1100 5873659 441602853 445568504 27.74 822925 817547 443009 432131 145 148 324 331 205 163 117139196 73 0 0 0 2 2 + 41 part 135249 98.727 1.004 0.00 3.52 5.97 5462789 440969958 740285 734066 383626 379627 167 169 368 376
peregrine.fasta_racon_polished_winnowmap_asm20.alignments.fasta 1100 1100 1100 1100 971 839 441684119 441684119 441684119 441684119 439428392 434468379 1100 5873685 441684119 445568504 27.74 822925 817549 443009 432346 145 148 324 331 201 160 113316841 100 0 0 2 5 4 + 59 part 277256 98.714 1.004 0.00 4.06 6.50 5463573 440896765 740285 734066 383626 379641 167 169 368 376
peregrine.fasta_racon_polished_winnowmap_ee6bca7_asm20.alignments 1100 1100 1100 1100 971 839 441602294 441602294 441602294 441602294 439351947 434393955 1100 5873683 441602294 445568504 27.74 822925 817547 442995 432131 145 148 324 331 203 161 113250940 65 0 0 0 6 2 + 40 part 186994 98.722 1.004 0.00 3.46 5.80 5462763 440925028 740288 734066 383633 379630 167 169 368 376
peregrine.fasta_racon_polished_minimap2_map_pb.alignments.fasta 1100 1100 1100 1100 971 839 441604293 441604293 441604293 441604293 439354082 434396590 1100 5873655 441604293 445568504 27.74 822925 817546 442986 432137 145 148 324 331 204 162 116822205 82 0 0 0 3 2 + 44 part 158859 98.722 1.004 0.00 3.57 6.02 5462788 440923550 740293 734066 383625 379628 167 169 368 376
peregrine.fasta_racon_polished_winnowmap_map_pb.alignments.fasta 1100 1100 1100 1100 971 839 441606529 441606529 441606529 441606529 439355783 434397824 1100 5873669 441606529 445568504 27.74 822925 817546 442986 432137 145 148 324 331 202 161 114327891 80 0 0 0 8 3 + 48 part 243962 98.720 1.003 0.00 3.58 5.84 5462776 440852929 740296 734066 383625 379633 167 169 368 376
peregrine.fasta_racon_polished_winnowmap_ee6bca7_map_pb.alignments 1100 1100 1100 1100 971 839 441606475 441606475 441606475 441606475 439354893 434396979 1100 5873665 441606475 445568504 27.74 822925 817546 442986 432137 145 148 324 331 201 159 113224872 74 0 0 0 5 4 + 47 part 226386 98.719 1.003 0.00 3.53 5.83 5462623 440875557 740296 734066 383625 379633 167 169 368 376
Okay, so the results are actually quite similar. Regardless of the mapper (minimap2
or winnowmap
) or preset (asm20
or map-pb
), each combination of mapper and preset seems to have its pros and cons. It will be interesting to see the results again once the asm20
setting for winnowmap
is configured properly.
EDIT: So with ee6bca7 (table updated above), winnowmap with asm20
option seems to be better in terms of indels and mismatches per 100 kbp.
Thanks for sharing these results.. I expect Winnowmap to deliver better mappings if there are long repetitive regions or low-complexity regions in the reference. Otherwise, both should give similar results.
You are welcome. I get similar results in terms of "quality values" (#mismatches per 100 kbp converted to Phred scores) when I use yak with simulated short-reads on the original reference as the k-mer database against the different assemblies I used with the QUAST-LG
analysis.
Have the asm20
asm10
or asm5
settings been enabled/optimized with the new commits?
I've made a few changes, and enabled settings for asm* presets.
But for CCS reads, I recommend you to stick with map-pb
preset. It uses homopolymer compressed k-mers (k=19). See PacBio / HiFi settings listed in the updated README file.
Please give it a try at your end. Again, I don't expect performance diffs unless the ref has long complex repeats.
I updated https://github.com/marbl/Winnowmap/issues/1#issuecomment-594388270 to include updated results from commit ee6bca7
Results from yak when I use yak
with simulated short-reads from the original reference using to simulate the CCS/HiFi reads
/genetics/elbers/yak/yak count -b37 -t75 -o sr.yak \
<(bgzip -@75 -cd small_insert_trim1.fq.gz) \
<(bgzip -@75 -cd small_insert_trim2.fq.gz) > yak.log 2>&1 &
for ref in `ls -1 *.fasta.gz`;do
/genetics/elbers/yak/yak qv -t75 sr.yak ${ref} > ${ref}-sr.qv.txt &
done
## explanation of values shown in tail command below
==> assembly-sr.qv.txt <==
QV raw_quality_value adjusted_quality_value
tail -n 1 *-sr.qv.txt
==> peregrine.fasta.gz-sr.qv.txt <==
QV 43.759 43.806
==> peregrine.fasta_racon_polished_minimap2_asm20.alignments.fasta.gz-sr.qv.txt <==
QV 43.901 43.936
==> peregrine.fasta_racon_polished_minimap2_map-pb.alignments.fasta.gz-sr.qv.txt <==
QV 43.812 43.848
==> peregrine.fasta_racon_polished_winnowmap_asm20.alignments.fasta.gz-sr.qv.txt <==
QV 43.497 43.543
==> peregrine.fasta_racon_polished_winnowmap-ee6bca7_asm20.alignments.fasta.gz-sr.qv.txt <==
QV 43.827 43.867
==> peregrine.fasta_racon_polished_winnowmap-ee6bca7_map-pb.alignments.fasta.gz-sr.qv.txt <==
QV 43.698 43.740
==> peregrine.fasta_racon_polished_winnowmap_map-pb.alignments.fasta.gz-sr.qv.txt <==
QV 43.701 43.747
Remember that the QUAST
error rates are based on alignment whilst the yak
method is alignment-free/k-mer based.
I must say BRAVO!!!
Hi,
I am using winnowmap commit d547331 and it seems like it is taking a very long time (my guess will be about 24 hours) and a lot of RAM (peak 265 GB) for mapping 1000000 simulated PacBio HiFi reads compared to minimap2-2.17 which took about 7 minutes and maybe ~9 GB RAM.
The assembly
peregrine.fasta
is about 450 Mbp, and I was trying to compare the results of one round of Racon polishing with alignments made byminimap2
andwinnowmap
, butwinnowmap
seems to be taking quite a long time.Is the much longer run time expected (obviously, run times would depend on the k-mers, assembly, and reads being used)?