sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
22 stars 5 forks source link

Optional first phasing symbol introduced in VCF 4.4 #263

Open timothymillar opened 2 weeks ago

timothymillar commented 2 weeks ago

The VCF 4.4 spec now allows for an initial symbol indicating the phasing of the first allele. For example, /0/1 is a valid genotype. At present, vcf2zarr is raising on this input with Couldn't read GT data: value not a number or '.' ....

timothymillar commented 2 weeks ago

Related issue around supporting partial phasing in the VCF-Zarr spec: https://github.com/sgkit-dev/vcf-zarr-spec/issues/24

jeromekelleher commented 2 weeks ago

I think we'll need to wait on htslib and cyvcf2 support for this - presumably it'll be a while coming through the pipeline. I had a quick scan of the htslib issue tracker but didn't find anything.

What does bcftools view give for this VCF @timothymillar?

timothymillar commented 2 weeks ago

Good point, I don't think we can do anything for now. With the VCF:

##fileformat=VCFv4.4
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20201009
##source=.
##reference=./simple.fasta
##contig=<ID=CHR1,length=60>
##contig=<ID=CHR2,length=60>
##contig=<ID=CHR3,length=60>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
CHR1    2       .       A       T       60      PASS    NS=3;AC=3       GT      /1/1    /0/0    /0/0
CHR1    7       .       A       C       60      PASS    NS=3;AC=4       GT      /0/0    /0/1    /0/1

bcftools view (version 1.20) omits all of the records (nothing after #CHROM ...

bcftools view (version 1.10.2) inserts an additional reference allele:

...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
CHR1    2       .       A       T       60      PASS    NS=3;AC=3       GT      0/1/1   0/0/0   0/0/0
CHR1    7       .       A       C       60      PASS    NS=3;AC=4       GT      0/0/0   0/0/1   0/0/1
jeromekelleher commented 2 weeks ago

Hmm - that's not a great sign. I don't think this feature is going to get used much for a while.