twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

Enhance haplotagging with region option in v1.7 #68

Closed zhengzhenxian closed 2 months ago

zhengzhenxian commented 2 months ago

Hi, teams,

After the LongPhase haplotag v1.7 update discussed in issue before, we attempted to enhance the haplotagging process by introducing the "--regions" option to specify contig names for parallel processing of the BAM file. However, we observed that providing the region still generates the entire BAM file instead of a chromosome-level haplotagged BAM. As a result, the remaining unhaplotagged BAM consumes a large amount of hard disk space. It would be greatly appreciated if you could consider implementing a feature to output a smaller BAM file.

Our workflow is listed here, pls let me know if I misunderstood the option. Thanks!

sloth-eat-pudding commented 2 months ago

I used the HCC1395.bam 75x 223G . run the workflow with LongPhase in conda Clair3 version 1.0.7 and ClairS commit f2606f66 . The haplotag output tumor_chr20.bam file is 6.2G. Could you provide your output log or data?

> samtools idxstats ONT_HCC1395/alignment-sort-hcc1395.bam

chr1    248956422       2384760 0
chr2    242193529       1537375 0
chr3    198295559       1386062 0
chr4    190214555       1449514 0
chr5    181538259       1171618 0
chr6    170805979       1121054 0
chr7    159345973       1720479 0
chr8    145138636       1113230 0
chr9    138394717       1025072 0
chr10   133797422       955545  0
chr11   135086622       798248  0
chr12   133275309       816144  0
chr13   114364328       646980  0
chr14   107043718       733506  0
chr15   101991189       675494  0
chr16   90338345        959480  0
chr17   83257441        647000  0
chr18   80373285        744354  0
chr19   58617616        379711  0
chr20   64444167        710333  0
chr21   46709983        410616  0
chr22   50818468        482982  0
chrX    156040895       620177  0
chrY    57227415        77634   0
chrM    16569   9241    0
... Omit chromosome
chrEBV  171823  28      0
*       0       0       1037967
> samtools idxstats clairs-output/tmp/clair3_output/phased_output/tumor_chr20.bam

chr1    248956422       0       0
chr2    242193529       0       0
chr3    198295559       0       0
chr4    190214555       0       0
chr5    181538259       0       0
chr6    170805979       0       0
chr7    159345973       0       0
chr8    145138636       0       0
chr9    138394717       0       0
chr10   133797422       0       0
chr11   135086622       0       0
chr12   133275309       0       0
chr13   114364328       0       0
chr14   107043718       0       0
chr15   101991189       0       0
chr16   90338345        0       0
chr17   83257441        0       0
chr18   80373285        0       0
chr19   58617616        0       0
chr20   64444167        710333  0
chr21   46709983        0       0
chr22   50818468        0       0
chrX    156040895       0       0
chrY    57227415        0       0
chrM    16569   0       0
... Omit chromosome
chrEBV  171823  0       0
*       0       0       0
> longphase output log

phased SNP file:   test-ont-clairs-output/tmp/clair3_output/phased_output/tumor_phased_chr20.vcf.gz
phased SV file:    
phased MOD file:   
input bam file:    ONT_HCC1395/alignment-sort-hcc1395.bam
input ref file:    GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
output bam file:   test-ont-clairs-output/tmp/clair3_output/phased_output/tumor_chr20.bam
number of threads: 100
write log file:    false
log file:          
-------------------------------------------
tag region:                    chr20
filter mapping quality below:  1
percentage threshold:          0.6
tag supplementary:             false
-------------------------------------------
parsing SNP VCF ... 0s
tag read start ...
chr: chr20 ... 97s
tag read 100s
-------------------------------------------
total process time:  100s
total alignment:     710333
total supplementary: 26197
total secondary:     0
total unmapped:      0
total tag alignment: 332186
total untagged:      378147
zhengzhenxian commented 2 months ago

@sloth-eat-pudding

Thanks for the quick reply, sorry that I used the outdated workflow for evaluation. I tested with the latest code and the function works properly.