samtools / htslib

C library for high-throughput sequencing data formats
Other
813 stars 447 forks source link

VCF added sample has undefined format fields #282

Open noporpoise opened 9 years ago

noporpoise commented 9 years ago

When adding a new sample to a VCF, format fields for the new sample seem to take on undefined values rather than the "missing" value. What is the best way to set all format fields to missing for a new sample?

I'm adding a sample and format field to a VCF with the following code:

// open input / output
htsFile *vin = hts_open(vcf_path, "r");
if(vin == NULL) die("Cannot read %s: %s", vcf_path, strerror(errno));
htsFile *vout = hts_open(out_path, "w");
if(vout == NULL) die("Cannot write %s: %s", out_path, strerror(errno));

// read/write headers
bcf_hdr_t *hdrin = bcf_hdr_read(vin);
bcf_hdr_t *hdrout = bcf_hdr_dup(hdrin);
bcf_hdr_append(hdrout, "##FORMAT=<ID=COV,Number=G,Type=Float,Description=\"Genotype?\">\n");
bcf_hdr_add_sample(hdrout, "NewSample");

if(bcf_hdr_write(vout, hdrout) != 0) die("Cannot write header");

// read vcf entries
bcf1_t *v = bcf_init();

size_t i, nsamples = bcf_hdr_nsamples(hdrout);
float *cov = calloc(2 * nsamples, sizeof(float));

while(bcf_read(vin, hdrin, v) >= 0)
{
  // Unpack ref,alt,filter,info info
  bcf_unpack(v, BCF_UN_ALL);

  // Set cov to something
  for(i = 0; i < 2*nsamples; i++) cov[i] = i;

  // Update cov
  bcf_update_format_float(hdrout, v, "COV", cov, 2*nsamples);

  // Print
  if(bcf_write(vout, hdrout, v) != 0) die("Cannot write record");
}

free(cov);

bcf_destroy(v);
bcf_hdr_destroy(hdrin);
bcf_hdr_destroy(hdrout);
hts_close(vin);
hts_close(vout);

Input VCF:

##fileformat=VCFv4.1
##fileDate=20150920
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=K29,Number=0,Type=Flag,Description="Found at k=29">
##FORMAT=<ID=HI,Number=G,Type=Float,Description="field">
##reference=ref/ref.fa
##contig=<ID=ref,length=599>
##contig=<ID=2,length=11>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ben jerry
ref 77  id0 AC  A   .   PASS    K29 HI  1.1,2.2 .,.
ref 125 id1 C   CA  .   PASS    K29 HI  2.2,3.3 4.4,5.5

Output VCF which seems to have weird values in the pre-existing format field HI for NewSample:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150920
##INFO=<ID=K29,Number=0,Type=Flag,Description="Found at k=29">
##FORMAT=<ID=HI,Number=G,Type=Float,Description="field">
##reference=ref/ref.fa
##contig=<ID=ref,length=599>
##contig=<ID=2,length=11>
##FORMAT=<ID=COV,Number=G,Type=Float,Description="Genotype?">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ben jerry   NewSample
ref 77  id0 AC  A   .   PASS    K29 HI:COV  1.1,2.2:0,1 .,.:2,3 0,3.58732e-43:4,5
ref 125 id1 C   CA  .   PASS    K29 HI:COV  2.2,3.3:0,1 4.4,5.5:2,3 0,3.58732e-43:4,5

I can't see any easy way to set all format fields to missing value for a new sample. Is there a simple way? It seems the fastest way would be from within htslib. A function in vcf.h would be useful, I can have a go at writing one if it's needed.

mp15 commented 8 years ago

I think the problem here is that:

Therefore there's nothing that tells the API that record v, is setup to correspond to the new changed header. I am considering writing a function bcf_translate_hdr with arguments old_hdr, new_hdr, and record to do the translation from old header to new header. Will that do for you?

noporpoise commented 8 years ago

That sounds like it would be very useful. What is the correct way to add a sample to a VCF and print it out currently? Should I be editing the input header and not using an output header?