sgkit-dev / vcztools

Partial reimplementation of bcftools for VCF Zarr
Apache License 2.0
4 stars 3 forks source link

Add provenance header #69

Closed Will-Tyler closed 2 months ago

Will-Tyler commented 3 months ago

Overview

This pull request makes progress on #46—it adds the command line provenance to VCF output headers.

Testing

I added a unit test that checks the output for the header.

Example

vcztools view vcz_test_cache/sample.vcf.vcz 
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=X>
##vcztools_viewCommand=view vcz_test_cache/sample.vcf.vcz; Date=2024-08-31 16:14:10.986683
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
19      111     .       A       C       9.6     .       .       GT:HQ   0|0:10,15       0|0:10,10       0/1:3,3
19      112     .       A       G       10      .       .       GT:HQ   0|0:10,10       0|0:10,10       0/1:3,3
20      14370   rs6054257       G       A       29      PASS    AF=0.5;DB;DP=14;H2;NS=3 GT:DP:GQ:HQ     0|0:1:48:51,51  1|0:8:48:51,51  1/1:5:43:.,.
20      17330   .       T       A       3       q10     AF=0.017;DP=11;NS=3     GT:DP:GQ:HQ     0|0:3:49:58,50  0|1:5:3:65,3    0/0:3:41:.,.
20      1110696 rs6040355       A       G,T     67      PASS    AA=T;AF=0.333,0.667;DB;DP=10;NS=2       GT:DP:GQ:HQ     1|2:6:21:23,27  2|1:0:2:18,2    2/2:4:35:.,.
20      1230237 .       T       .       47      PASS    AA=T;DP=13;NS=3 GT:DP:GQ:HQ     0|0:.:54:56,60  0|0:4:48:51,51  0/0:2:61:.,.
20      1234567 microsat1       G       GA,GAC  50      PASS    AA=G;AC=3,1;AN=6;DP=9;NS=3      GT:DP:GQ        0/1:4:. 0/2:2:17        ./.:3:40
20      1235237 .       T       .       .       .       .       GT      0/0     0|0     ./.
X       10      rsTest  AC      A,ATG,C 10      PASS    .       GT      0       0/1     0|2

Discussion

I wasn't sure if the vcztools version is defined yet, so I have not added the vcztools version header yet.

I use the default date format in the vcztools header, which is different from the date format in the corresponding header in bcftools' output. Let me know if I should change the format to match bcftools.

References

jeromekelleher commented 2 months ago

Needs a rebase here please @Will-Tyler

Will-Tyler commented 2 months ago

Should be rebased now!