jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
402 stars 200 forks source link

50 percent inflation #196

Open iceback opened 9 years ago

iceback commented 9 years ago

Hoping to use PyVCF but a quick test suggested that file size would increase 50%. Given the gross size of these files nowadays (1T not unheard of), I would have to consider this a bug - even a show-stopper for me.

import vcf vcf_reader = vcf.Reader(open('/export/home/rob/work/data/discord/testDiscord.vcf', 'r')) vcf_writer = vcf.Writer(open('/export/home/rob/work/data/discord/rewrite.vcf', 'w'), vcf_reader) ln = 1 for record in vcf_reader: vcf_writer.write_record(record)

ls -ltr *.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf -rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 rewrite.vcf

Given that the testDiscord.vcf is only the first 5000 lines of the real file of 18,000,000 lines. Naturally I would rather not inflate that un-necessarily. Looking at the the diffs of the (tails of) files it appears to be that the PyVCF writes out full genotype values for no-calls: ./.:.:.:.:. vs the simple ./. in the original.

iceback commented 9 years ago

This helps, shaves off half the bloat. Sorry not set up for pulls... -rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 firstRewrite.vcf -rw-rw-r--. 1 rob analysis 49602392 Mar 18 09:55 rewrite.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf

def _format_sample(self, fmt, sample):
    if hasattr(sample.data, 'GT'):
        gt = sample.data.GT
    else:
        gt = './.' if 'GT' in fmt else ''

    if not gt:
        return ':'.join([self._stringify(x) for x in sample.data])
    # Following the VCF spec, GT is always the first item whenever it is present.
    else:
        if (gt == './.' or gt == '.|.') and self._allnull(sample.data[1:]):
            return gt
        else:
            return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])

def _allnull(self, datavec):
    for d in datavec:
        if d:
            return False
    return True
iceback commented 9 years ago

oops return gt NOT return [gt]

iceback commented 9 years ago

There is an inconsistency reading in "./.". Usually the rest of the genotype vector is null (prints out "./."), but sometimes the genotype vector gets initialized with zeroed values (./.:0,0:0:.:.). The format def for this example is "GT:AD:DP:GQ:PL" but behaviour is inconsistent across one line of input. But I'm now down to about 7% inflation and I might live with that!

-rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 firstRewrite.vcf -rw-rw-r--. 1 rob analysis 41115632 Mar 18 10:04 rewrite.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf

The parser inconsistency in reading in "./." should be a separate issue.