Implement chromosome-level parallelization for the modcall and phase commands. The overall execution time is reduced 71% ~ 88%.
Replace malloc with jemalloc.
Remove and simplify unused parameters to improve memory usage.
Adjust the weighting of low-quality variants in phasing.
The VCF generated by modcall can be directly imported into IGV. Additionally, modcall can output all detected coordinates by using the --all parameter.
phase (-t 24)
v1.5.2 (Time)
v1.5.2 (Memory)
v1.6 (Time)
v1.6 (Memory)
HG002 ONT R10.4.1 10x
153s
7.7G
39s
15.1G
HG002 ONT R10.4.1 20x
444s
8.2G
53s
15.6G
HG002 ONT R10.4.1 30x
355s
8.5G
68s
24.4G
HG002 ONT R10.4.1 40x
908s
8.8G
217s
26.6G
HG002 ONT R10.4.1 50x
1043s
9.2G
262s
22.2G
HG002 ONT R10.4.1 60x
640s
9.5G
113s
33.4G
modcall (-t 24)
v1.5.2 (Time)
v1.5.2 (Memory)
v1.6 (Time)
v1.6 (Memory)
HG002 ONT R10.4.1 10x
322s
11.0G
93s
22.2G
HG002 ONT R10.4.1 20x
635s
14.6G
199s
31.6G
HG002 ONT R10.4.1 30x
746s
18.2G
125s
48.1G
HG002 ONT R10.4.1 40x
1308s
21.5G
292s
55.8G
HG002 ONT R10.4.1 50x
1570s
25.0G
317s
68.8G
HG002 ONT R10.4.1 60x
1454s
28.4G
248s
84.0G
Changes
Makefile Adjustments
Added -fopenmp flag in CPPFLAGS to enable OpenMP support, which allows for efficient multi-threading in the C++ components.
Modifications in ParsingBam.cpp, ParsingBam.h
Introduced an additional parameter int &numThreads in the function direct_detect_alleles. This change allows for dynamic allocation of threads based on the processing requirements, improving the handling of multi-threaded operations.
Updates in Phasing.cpp
Modified the default value of threads argument to 0. This change signifies that, by default, the program will utilize all available threads, optimizing resource usage.
Major Refactoring in PhasingProcess.cpp
Implemented a new function setNumThreads for intelligent distribution of threads between chromosome processing and BAM parsing, enhancing parallel processing efficiency.
Established a ChrPhasingResult map to handle phasing results in a thread-safe manner.
Merged individual chromosome phasing results into a single mergedPhasingResult, streamlining the result aggregation process.
Adjustments to Modcall Output
Added the --all parameter to output all detected modifications in reads. default false.
Homozygous variants only recorded MD, UD, and DP counts. The read names covering the variant will not be recorded.
High-confidence heterozygous modifications will be recorded as PASS in the FILTER field.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 11027 . C N . . RS=P; GT:MD:UD:DP 1/1:5:5:16
chr1 11028 . G N . . RS=N; GT:MD:UD:DP 0/0:4:8:12
chr1 11083 . G N . . RS=N; GT:MD:UD:DP 0/1:4:6:12
chr1 11434 . C N . PASS RS=P;MR=eb459876-8c81-4714-a496-a90ea8be94d2,6ca3a71f-62fd-416e-8c6e-8c4a9c054e1a,0d8b7c68-d98d-4045-a572-82fedac62da5,71b5dfe9-7cd1-4959-b634-b9d162468edb,5db6fcd3-5780-494b-bfdd-4d9d7282d012,6e205f04-560a-4975-932d-dbd60ead695d,ae2238f8-b622-4cb5-8f02-4c3f54ab8ca3,d7ef87d0-bffe-404d-9f10-62eceb0c5121;NR=8249c7c7-fd04-4a4c-985d-2bbcb2030bc4,e3b03dc0-8399-4e1e-bd0d-e5e4e3e2911a,b8dc7af8-9f78-4dac-b8cc-35e157c51621,0b2638c1-8380-48b4-b08c-a6161495ad9d,7b1e5c16-f0a6-47f6-9726-247c762a10ca,ac7ae685-082e-413e-b5a6-b4c12d49a1c2,c0e0c526-e193-4ee2-81d6-1fbe0c970dc1; GT:MD:UD:DP 0/1:8:7:16
Add the weight to the edge which connect to the low quality base
The weight of the edges connects to the low-quality base (base-quality>=12 ) will change from 1 to 0.1. The data structures used to count the read amounts are change from int to float.
MethFastaParser Utilizing New Structure:
Revised the storage structure of references fasta to include chromosome length information, facilitating chromosome processing in the correct numerical order (chr1, chr2, chr3) instead of lexicographical order (chr1, chr11, chr12).
This change not only eliminates the need to recalculate chromosome lengths but also enhances execution efficiency in a multithreaded environment.
Modifications in MethBamParser:
Introduced an additional parameter int numThreads in the function detectMeth. This change allows for dynamic allocation of threads based on the processing requirements, improving the handling of multi-threaded operations.
Thread Safety Measures:
Split the writeResultVCF function into two parts: exportResult and writeResultVCF.
exportResult: Handles the processing results for each chromosome, preparing data for VCF file writing.
writeResultVCF: Tasked with the actual writing of data into the VCF file, ensuring the integrity and sequentiality of output.
Changes in ModCallProcess:
New Function - setModcallNumThreads :Implemented to intelligently allocate threads between chromosome processing and BAM parsing tasks.
Included jemalloc as a dependency in the build configuration.
The phase command will display the values of each parameter and adjust the output messages
In the phasingProcess, the storeResultPath() step has been removed, and phasing results are now directly recorded in edgeConnectResult().
Summary
modcall
andphase
commands. The overall execution time is reduced 71% ~ 88%.malloc
withjemalloc
.modcall
can be directly imported into IGV. Additionally,modcall
can output all detected coordinates by using the--all
parameter.Changes
Makefile Adjustments
Modifications in ParsingBam.cpp, ParsingBam.h
Updates in Phasing.cpp
Major Refactoring in PhasingProcess.cpp
Adjustments to Modcall Output
--all
parameter to output all detected modifications in reads. default false.Add the weight to the edge which connect to the low quality base
MethFastaParser Utilizing New Structure:
Modifications in MethBamParser:
Thread Safety Measures:
Changes in ModCallProcess:
Included
jemalloc
as a dependency in the build configuration.The
phase
command will display the values of each parameter and adjust the output messagesIn the phasingProcess, the
storeResultPath()
step has been removed, and phasing results are now directly recorded inedgeConnectResult()
.