For below data example with two haplotypes, each with a different overlapping indel variant, that are represented in the unphased input vcf file as three variants (a one 1 bp deletion immediately followed by a 8 bp deletion for haplotype "earlyIndel" and a 5 bp deletion for haplotype "lateDeletion") HapCUT2 (with --indel 1) leaves the the second variant unphased. Is this expected?
Fully reproducible example (using current master v1.3.1 0eb7075424afbb40259ced6b5b1cbd95d0cedf55 and current develop of htslib: 1.18 99415e2a)
Extracting haplotype informative reads from bamfiles reads.bam minQV 13 minMQ 20 maxIS 1000
VCF file unphased.vcf has 3 variants
adding chrom chr19 to index
vcffile unphased.vcf chromosomes 1 hetvariants 3 variants 3
detected 0 variants with two non-reference alleles, these variants will not be phased
reading fasta index file chr19.fa.fai ... fasta file chr19.fa has 1 chromosomes/contigs
found match for reference contig chr19 in VCF file index
contig chr19 length 61431566
reading reference sequence file chr19.fa with 1 contigs
read reference sequence file in 0.37 sec
reading sorted bamfile reads.bam
processing reads mapped to chrom "chr19"
final cleanup of fragment list: 8 current chrom 0 prev 0
[2023:10:26 11:04:40] input fragment file: fragment_file
[2023:10:26 11:04:40] input variantfile (VCF format):unphased.vcf
[2023:10:26 11:04:40] haplotypes will be output to file: haplotype_output_file
[2023:10:26 11:04:40] solution convergence cutoff: 5
[2023:10:26 11:04:40] read 3 variants from unphased.vcf file
[2023:10:26 11:04:40] read fragment file and variant file: fragments 7 variants 3
mean number of variants per read is 2.00
[2023:10:26 11:04:40] building read-variant graph for phasing
[2023:10:26 11:04:40] no of non-trivial connected components 1 max-Degree 7 connected variants 2 coverage-per-variant 7.000000
[2023:10:26 11:04:40] comp 0 first 0 last 2 phased 2 fragments 7
[2023:10:26 11:04:40] fragments 7 snps 3 component(blocks) 1
[2023:10:26 11:04:40] starting Max-Likelihood-Cut based haplotype assembly algorithm
[2023:10:26 11:04:40] HIC ITER 0
[2023:10:26 11:04:40] PHASING ITER 0
[2023:10:26 11:04:40] component 0 length 3 phased 2 0...2
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 1
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 2
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 3
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 4
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 5
[2023:10:26 11:04:40] component 0 offset 0 length 3 phased 2 LL -0.0 cutvalue 13.909938 bestLL -0.03
[2023:10:26 11:04:40] PHASING ITER 6
[2023:10:26 11:04:40] starting to post-process phased haplotypes to further improve accuracy
[2023:10:26 11:04:40] starting to output phased haplotypes
[2023:10:26 11:04:40] OUTPUTTING PRUNED HAPLOTYPE ASSEMBLY TO FILE haplotype_output_file
[2023:10:26 11:04:40] N50 haplotype length is 0.01 kilobases
[2023:10:26 11:04:40] OUTPUTTING PHASED VCF TO FILE haplotype_output_file.phased.VCF
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##INFO=<ID=hapcut2,Number=1,Type=Integer,Description="phased by HapCUT2 or not">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##contig=<ID=chr19,length=61431566>
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="ID of Phase Set for Variant">
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phred QV indicating probability that this variant is incorrectly phased relative to the haplotype">
##FORMAT=<ID=PD,Number=1,Type=Integer,Description="phased Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CGAAGAGGTAGGTGCGAG-1
chr19 55910646 . AC A 7447.85 PASS AC=1 GT:AD:DP:PQ:PD:PS 1|0:47,35:82:100:7:55910646
chr19 55910648 . AAATCCCCC A 12412.8 PASS AC=1 GT:AD:DP:PS 0/1:47,35:82:.
chr19 55910653 . CCCCAT C 13372.2 PASS AC=1 GT:AD:DP:PQ:PD:PS 0|1:0,47:82:100:7:55910646
If I add a second * allele to the third variant to denote the overlapping deletion (as suggested by VCFv4.3) , and add --triallelic 1 I get the same results. Nor do --new_format 1 and --realign_variants 1 change the results.
For below data example with two haplotypes, each with a different overlapping indel variant, that are represented in the unphased input vcf file as three variants (a one 1 bp deletion immediately followed by a 8 bp deletion for haplotype "earlyIndel" and a 5 bp deletion for haplotype "lateDeletion") HapCUT2 (with --indel 1) leaves the the second variant unphased. Is this expected?
Fully reproducible example (using current master v1.3.1 0eb7075424afbb40259ced6b5b1cbd95d0cedf55 and current develop of htslib: 1.18
99415e2a
)If I add a second
*
allele to the third variant to denote the overlapping deletion (as suggested by VCFv4.3) , and add--triallelic 1
I get the same results. Nor do--new_format 1
and--realign_variants 1
change the results.