twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

SHAPEIT4 output for haptag #85

Closed Yijun-Tian closed 3 weeks ago

Yijun-Tian commented 3 weeks ago

Hi, To further improve the read level phasing performance, we use the population based (genotype based) haplotype reference to adjusting the longphase phase vcf in SHAPEIT4. Here is the output VCF.

##  longphase phase VCF
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
chr22   10522327        chr22_10522327_T_C      T       C       16      PASS    .       GT:GQ:DP:AD:VAF:C:PS    1/1:4:5:5:1:DV:.
chr22   10522392        chr22_10522392_C_T      C       T       16      PASS    .       GT:GQ:DP:AD:VAF:C:PS    1/1:4:5:5:1:DV:.
chr22   10522452        chr22_10522452_C_T      C       T       18.8    PASS    .       GT:GQ:DP:AD:VAF:C:PS    1/1:6:5:5:1:DV:.

##  SHAPEIT4 output VCF
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
chr22   10526217        chr22_10526217_A_G      A       G       .       .       AC=2;AF=1;CM=0  GT      1|1
chr22   10526550        chr22_10526550_G_A      G       A       .       .       AC=2;AF=1;CM=0.000681428        GT      1|1
chr22   10572965        chr22_10572965_A_C      A       C       .       .       AC=2;AF=1;CM=0.0956619  GT      1|1
chr22   10573343        chr22_10573343_C_G      C       G       .       .       AC=2;AF=1;CM=0.0964354  GT      1|1

However, when I use the VCF from SHAPEIT4 for haplotagging, error happens like below:

$  ~/longphase haplotag -s Shannon/shapeit4/s246.chr22.snv.vcf.gz -b s246.sorted.bam -r hg38.alignmt.fa -o s246.chr22
phased SNP file:   Shannon/shapeit4/s246.chr22.snv.vcf.gz
phased SV file:
input bam file:    s246.sorted.bam
input ref file:    hg38.alignmt.fa
output bam file:   s246.chr22.bam
number of threads: 1
write log file:    false
log file:
-------------------------------------------
filter mapping quality below:  20
percentage threshold:          0.6
tag supplementary:             false
-------------------------------------------
parsing SNP VCF ... terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
Aborted (core dumped)

Since the SHAPEIT4 output only 1 phase set ideally, could you clarify what INFO/FORMAT fields are needed for the haplotag step? I was thinking to define these field by myself so that can use haplotag to label each read for downstream analysis.

Thanks,

twolinin commented 3 weeks ago

Hi, @Yijun-Tian

I performed some tests and did not encounter this error message. Could you please provide the version of LongPhase you are using, and the SHAPEIT4 VCF header you used?

Thanks

Yijun-Tian commented 3 weeks ago

Thanks @twolinin ! I am using longphase 1.5.1 for the haplotagging:

shapeit4]$ longphase --version
Version: 1.5.1
Usage: main <command> [options]
               phase      run phasing algorithm.
               haplotag   tag reads by haplotype.
               modcall    convert bam file to modification vcf file.

Here is the header of SHAPEIT4 VCF:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=16/08/2024 - 15:36:24
##source=shapeit4.1.3
##contig=<ID=chr22>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Allele count">
##INFO=<ID=CM,Number=A,Type=Float,Description="Interpolated cM position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##bcftools_viewVersion=1.17+htslib-1.17
##bcftools_viewCommand=view -h s246.chr22.snv.vcf.gz; Date=Fri Aug 23 09:13:21 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
twolinin commented 3 weeks ago

Hi @Yijun-Tian

Please modify your header before ##contig=<ID=chr22> after ##contig=<ID=chr22,length=242193529>

Thanks

Yijun-Tian commented 3 weeks ago

Thanks, @twolinin. Haptag now works with the modification you noted.