Open iceback opened 9 years ago
This helps, shaves off half the bloat. Sorry not set up for pulls... -rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 firstRewrite.vcf -rw-rw-r--. 1 rob analysis 49602392 Mar 18 09:55 rewrite.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf
def _format_sample(self, fmt, sample):
if hasattr(sample.data, 'GT'):
gt = sample.data.GT
else:
gt = './.' if 'GT' in fmt else ''
if not gt:
return ':'.join([self._stringify(x) for x in sample.data])
# Following the VCF spec, GT is always the first item whenever it is present.
else:
if (gt == './.' or gt == '.|.') and self._allnull(sample.data[1:]):
return gt
else:
return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])
def _allnull(self, datavec):
for d in datavec:
if d:
return False
return True
oops return gt NOT return [gt]
There is an inconsistency reading in "./.". Usually the rest of the genotype vector is null (prints out "./."), but sometimes the genotype vector gets initialized with zeroed values (./.:0,0:0:.:.). The format def for this example is "GT:AD:DP:GQ:PL" but behaviour is inconsistent across one line of input. But I'm now down to about 7% inflation and I might live with that!
-rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 firstRewrite.vcf -rw-rw-r--. 1 rob analysis 41115632 Mar 18 10:04 rewrite.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf
The parser inconsistency in reading in "./." should be a separate issue.
Hoping to use PyVCF but a quick test suggested that file size would increase 50%. Given the gross size of these files nowadays (1T not unheard of), I would have to consider this a bug - even a show-stopper for me.
import vcf vcf_reader = vcf.Reader(open('/export/home/rob/work/data/discord/testDiscord.vcf', 'r')) vcf_writer = vcf.Writer(open('/export/home/rob/work/data/discord/rewrite.vcf', 'w'), vcf_reader) ln = 1 for record in vcf_reader: vcf_writer.write_record(record)
ls -ltr *.vcf -rw-rw-r--. 1 rob analysis 38225240 Mar 16 17:10 testDiscord.vcf -rw-rw-r--. 1 rob analysis 58232928 Mar 17 18:30 rewrite.vcf
Given that the testDiscord.vcf is only the first 5000 lines of the real file of 18,000,000 lines. Naturally I would rather not inflate that un-necessarily. Looking at the the diffs of the (tails of) files it appears to be that the PyVCF writes out full genotype values for no-calls: ./.:.:.:.:. vs the simple ./. in the original.