twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

LongPhase 1.6 Release Notes #46

Closed twolinin closed 9 months ago

twolinin commented 9 months ago

Summary

  1. Implement chromosome-level parallelization for the modcall and phase commands. The overall execution time is reduced 71% ~ 88%.
  2. Replace malloc with jemalloc.
  3. Remove and simplify unused parameters to improve memory usage.
  4. Adjust the weighting of low-quality variants in phasing.
  5. The VCF generated by modcall can be directly imported into IGV. Additionally, modcall can output all detected coordinates by using the --all parameter.
phase (-t 24) v1.5.2 (Time) v1.5.2 (Memory) v1.6 (Time) v1.6 (Memory)
HG002 ONT R10.4.1 10x 153s 7.7G 39s 15.1G
HG002 ONT R10.4.1 20x 444s 8.2G 53s 15.6G
HG002 ONT R10.4.1 30x 355s 8.5G 68s 24.4G
HG002 ONT R10.4.1 40x 908s 8.8G 217s 26.6G
HG002 ONT R10.4.1 50x 1043s 9.2G 262s 22.2G
HG002 ONT R10.4.1 60x 640s 9.5G 113s 33.4G
modcall (-t 24) v1.5.2 (Time) v1.5.2 (Memory) v1.6 (Time) v1.6 (Memory)
HG002 ONT R10.4.1 10x 322s 11.0G 93s 22.2G
HG002 ONT R10.4.1 20x 635s 14.6G 199s 31.6G
HG002 ONT R10.4.1 30x 746s 18.2G 125s 48.1G
HG002 ONT R10.4.1 40x 1308s 21.5G 292s 55.8G
HG002 ONT R10.4.1 50x 1570s 25.0G 317s 68.8G
HG002 ONT R10.4.1 60x 1454s 28.4G 248s 84.0G

Changes

  1. Makefile Adjustments

    • Added -fopenmp flag in CPPFLAGS to enable OpenMP support, which allows for efficient multi-threading in the C++ components.
  2. Modifications in ParsingBam.cpp, ParsingBam.h

    • Introduced an additional parameter int &numThreads in the function direct_detect_alleles. This change allows for dynamic allocation of threads based on the processing requirements, improving the handling of multi-threaded operations.
  3. Updates in Phasing.cpp

    • Modified the default value of threads argument to 0. This change signifies that, by default, the program will utilize all available threads, optimizing resource usage.
  4. Major Refactoring in PhasingProcess.cpp

    • Implemented a new function setNumThreads for intelligent distribution of threads between chromosome processing and BAM parsing, enhancing parallel processing efficiency.
    • Established a ChrPhasingResult map to handle phasing results in a thread-safe manner.
    • Merged individual chromosome phasing results into a single mergedPhasingResult, streamlining the result aggregation process.
  5. Adjustments to Modcall Output

    • Added the --all parameter to output all detected modifications in reads. default false.
    • Homozygous variants only recorded MD, UD, and DP counts. The read names covering the variant will not be recorded.
    • High-confidence heterozygous modifications will be recorded as PASS in the FILTER field.
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    11027   .       C       N       .       .       RS=P;   GT:MD:UD:DP     1/1:5:5:16
chr1    11028   .       G       N       .       .       RS=N;   GT:MD:UD:DP     0/0:4:8:12
chr1    11083   .       G       N       .       .       RS=N;   GT:MD:UD:DP     0/1:4:6:12
chr1    11434   .       C       N       .       PASS    RS=P;MR=eb459876-8c81-4714-a496-a90ea8be94d2,6ca3a71f-62fd-416e-8c6e-8c4a9c054e1a,0d8b7c68-d98d-4045-a572-82fedac62da5,71b5dfe9-7cd1-4959-b634-b9d162468edb,5db6fcd3-5780-494b-bfdd-4d9d7282d012,6e205f04-560a-4975-932d-dbd60ead695d,ae2238f8-b622-4cb5-8f02-4c3f54ab8ca3,d7ef87d0-bffe-404d-9f10-62eceb0c5121;NR=8249c7c7-fd04-4a4c-985d-2bbcb2030bc4,e3b03dc0-8399-4e1e-bd0d-e5e4e3e2911a,b8dc7af8-9f78-4dac-b8cc-35e157c51621,0b2638c1-8380-48b4-b08c-a6161495ad9d,7b1e5c16-f0a6-47f6-9726-247c762a10ca,ac7ae685-082e-413e-b5a6-b4c12d49a1c2,c0e0c526-e193-4ee2-81d6-1fbe0c970dc1;  GT:MD:UD:DP     0/1:8:7:16
  1. Add the weight to the edge which connect to the low quality base

    • The weight of the edges connects to the low-quality base (base-quality>=12 ) will change from 1 to 0.1. The data structures used to count the read amounts are change from int to float.
  2. MethFastaParser Utilizing New Structure:

    • Revised the storage structure of references fasta to include chromosome length information, facilitating chromosome processing in the correct numerical order (chr1, chr2, chr3) instead of lexicographical order (chr1, chr11, chr12).
    • This change not only eliminates the need to recalculate chromosome lengths but also enhances execution efficiency in a multithreaded environment.
  3. Modifications in MethBamParser:

    • Introduced an additional parameter int numThreads in the function detectMeth. This change allows for dynamic allocation of threads based on the processing requirements, improving the handling of multi-threaded operations.
  4. Thread Safety Measures:

    • Split the writeResultVCF function into two parts: exportResult and writeResultVCF.
    • exportResult: Handles the processing results for each chromosome, preparing data for VCF file writing.
    • writeResultVCF: Tasked with the actual writing of data into the VCF file, ensuring the integrity and sequentiality of output.
  5. Changes in ModCallProcess:

    • New Function - setModcallNumThreads :Implemented to intelligently allocate threads between chromosome processing and BAM parsing tasks.
  6. Included jemalloc as a dependency in the build configuration.

  7. The phase command will display the values of each parameter and adjust the output messages

  8. In the phasingProcess, the storeResultPath() step has been removed, and phasing results are now directly recorded in edgeConnectResult().