vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
207 stars 36 forks source link

Indels phasing issue #138

Open KonstantinTyurin opened 1 week ago

KonstantinTyurin commented 1 week ago

There is some doubt about the phasing quality of the indels. In the image below, three different deletions are shown, and they overlap. When running HapCUT, it appears that these variants are phased together, even though it is clearly visible here that they are on different reads. Is it supposed to be like this?

image

Block from output txt file

BLOCK: offset: 15094 len: 3 phased: 3 SPAN: 295 fragments 90 15094 0 1 chr16 28902022 TACA T 0/1:336:336:252,254:51,52:39,38:319.85:1.26:0:0 0 . 100.00 4 15095 0 1 chr16 28902316 CCG C 0/1:164:164:149,150:8,8:9,9:165.64:2.31:0:0.01 0 . 100.00 90 15096 0 1 chr16 28902317 CG C 0/1:154:154:149,150:8,8:7,7:165.12:2.35:0:0.01 0 . 100.00 90

Code Example

extractHAIRS --VCF input_vcf --bam input_bam --out fragment_file --ref ref_fasta --indels 1 HAPCUT2 --fragments fragment_file --VCF input_vcf --output output_hapcut

KonstantinTyurin commented 1 week ago

Here is also examples of blocks which have variants with the same start position, which are not supposed to be phased.

BLOCK: offset: 1468 len: 2 phased: 2 SPAN: 0 fragments 123 1468 0 1 chr1 185965968 T TA 0/1:236:236:179,184:23,23:31,30:234.56:3.36:0:0.01 0 . 100.00 123 1469 0 1 chr1 185965968 TA T 0/1:236:236:177,179:25,28:31,30:234.56:3.36:0:0.01 0 . 100.00 123

BLOCK: offset: 1520 len: 2 phased: 2 SPAN: 0 fragments 38 1520 0 1 chr1 200554390 CA C 0/1:54:54:43,43:6,6:7,7:55.88:0.7:0:0 0 . 100.00 38 1521 0 1 chr1 200554390 CAA C 0/1:54:54:44,44:7,7:5,5:55.88:0.7:0:0 0 . 100.00 38

BLOCK: offset: 12510 len: 2 phased: 2 SPAN: 0 fragments 40 12510 0 1 chr12 71774044 TA T 0/1:80:80:66,66:6,7:8,7:91.97:0.09:0:0 0 . 100.00 40 12511 0 1 chr12 71774044 T TA 0/1:80:80:65,66:7,7:8,7:91.97:0.09:0:0 0 . 100.00 40

BLOCK: offset: 15757 len: 2 phased: 2 SPAN: 0 fragments 27 15757 0 1 chr17 18308670 CTTT C 0/1:43:43:33,33:8,8:2,2:48.76:0.24:0:0 0 . 100.00 27 15758 0 1 chr17 18308670 CTT C 0/1:43:43:34,34:6,6:3,3:48.76:0.24:0:0 0 . 100.00 27 BLOCK: offset: 18311 len: 2 phased: 2 SPAN: 0 fragments 9 18311 0 1 chr19 57077958 CA C 0/1:27:27:21,21:4,4:2,2:22.85:0.42:0:0 0 . 100.00 9 18312 0 1 chr19 57077958 CAA C 0/1:27:27:21,21:4,4:2,2:22.85:0.42:0:0 0 . 100.00 9

BLOCK: offset: 2339 len: 2 phased: 2 SPAN: 0 fragments 43 2339 0 1 chr2 79087440 T TA 0/1:54:54:45,48:5,5:4,3:64.18:0.56:0:0 0 . 100.00 43 2340 0 1 chr2 79087440 TA T 0/1:54:54:45,48:5,5:4,3:64.18:0.56:0:0 0 . 100.00 43

BLOCK: offset: 1037 len: 2 phased: 2 SPAN: 0 fragments 48 1037 0 1 chr1 111760455 AT A 0/1:77:77:61,63:13,13:3,4:89.79:0.67:0:0 0 . 100.00 48 1038 0 1 chr1 111760455 ATT A 0/1:77:77:65,67:12,12:0,1:89.79:0.67:0:0 0 . 100.00 48

vibansal commented 4 days ago

HapCUT2 can give incorrect phasing results when the input variants/genotypes are incorrect. This happens more often with false indels in low-complexity regions. This issue is related to allele-parsing (extractHAIRS) rather than the haplotype phasing itself.