Open alimanfoo opened 11 years ago
It is a bit of a hack, but currently I use something like this (this is for INFO
, but I assume you could do something similar for FORMAT
):
from random import random
from vcf.parser import _Info as VcfInfo, field_counts as vcf_field_counts
import vcf
reader = vcf.Reader(open('original.vcf'))
reader.infos['EXAMPLE_INFO'] = VcfInfo(
'EXAMPLE_INFO', vcf_field_counts['A'], 'Float',
'Example info value for each allele')
writer = vcf.Writer(open('with_example_info.vcf', 'w'), reader,
lineterminator='\n')
for record in reader:
record.add_info('EXAMPLE_INFO', [random() for _ in record.ALT])
writer.write_record(record)
It's not pretty, but I think in general modifying and writing VCF files is not covered well (yet) in PyVCF.
Its not easy at the minute. For performance reasons, we store the data from each cell in a namedtuple (see https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py#L529 ). These are immutable, and hence its hard to add data to a CallData.
There are two ways I can imagine doing this off the top of my head:
Any opinions, patches on these approaches welcome ;)
Hi Martijn,
Thanks for the reply, I've been using that pattern to add INFO fields to the header and INFO values to variants and it's working well. The problem I have is that I now want to modify the actual call data, not just add a new FORMAT field to the header, and as James says the call data is currently immutable because it's stored in a named tuple.
Checkout this branch... https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data
In [1]: import vcf
In [2]: r = vcf.Reader(open('vcf/test/miseq.vcf'))
In [3]: c = r.next().samples[0]
In [4]: c.add_field('FOO', 'BAR')
In [5]: c
Out[5]: Call(sample=Sample1, CallData(GT=0/1, AD=[244, 6], DP=250, GQ=99.0, PL=[374, 0, 3216], VF=0.024, FOO=BAR))
So it works, but no effort is made to maintain ordering or consistency between the fields expected from the record and in the call data. Thoughts on a suitable API?
@alimanfoo ?? Any feedback?
Hi James, apologies, I've had my head in another problem. Thanks for scuba quick response, I'll try out your patch ASAP and get back to you re the API.
On 20 Jan 2013, at 20:26, James Casbon notifications@github.com wrote:
@alimanfoo https://github.com/alimanfoo ?? Any feedback?
— Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/82#issuecomment-12476793.
Suggested branch changes in https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data are not going to be included in the master branch and future releases?? Being able to add unexistent FORMAT fields is a quite interesting fact that can easily be performed with the mentioned changes in model.py. At least I'm using them in this way, for including allele frequencies in the format fields:
c = record.samples[index]
try:
c.add_field('FREQ', 'bar')
except ValueError:
pass
The issue is still open, any plans to close it? Meanwhile, the "easiest" way to do it is to create a new record copying all information from the parsed one, create a new namedtuple (named CallData), create new values for that and add modified values to the new tuple. In the end one sample.data = newcalldata. I had to use it to modify some values, like this:
# vcf_parser is a VCFReader
# vcf_writer is a VCFWriter
import collections
import copy
f_keys = vcf_parser.formats.keys() #it's an ordered dict
record = vcf_parser.next()
new_record = copy.deepcopy(record)
for sx in range(len(record.samples)):
new_record.samples[sx].data = collections.namedtuple('CallData', f_keys)
f_vals = [record.samples[sx].data[vx] for vx in range(len(f_keys))]
handy_dict = dict(zip(f_keys, f_vals))
for f in ['EQ', 'SQ', 'NQ', 'LQ', 'RQ']: #some fields i wanted to modify
handy_dict[f] = [record.samples[sx][f][0]]
if handy_dict['GT'] == '2': # I also wanted to modify GT string here
handy_dict['GT'] = '0'
new_vals = [handy_dict[x] for x in f_keys]
# finally set CallData
new_record.samples[sx].data = new_record.samples[sx].data._make(new_vals)
vcf_writer.write_record(del_record)
It's ugly, but at least it works. I'm not sure if the deepcopy is appropriate here
@dawe : You can check the open pull request in #136 . I'm currently dealing with most of the common vcf edit tasks (at least those affecting my daily work) in the fork related to the pull request (https://github.com/gitanoqevaporelmundoentero/PyVCF). I'm actively using this fork until it gets merged with the official project.
The issue hasn't been resolved yet, so let's keep this one open. It contains some useful links and context and can still be used for general discussion.
The specific implementation in #136 can be discussed there.
I think this is easily solvable by using recordtype as a drop-in replacement for namedtuple: https://pypi.python.org/pypi/recordtype
I've read the discussion above and in #136 but am still unclear -- is there currently a way to add/edit genotype fields?
Since I didn't get any feedback on the open pull request in https://github.com/jamescasbon/PyVCF/pull/136, I continued developing the add/edit functions on a separate and private repository, which is currently working. In next weeks I'll open a new pull request with my changes, and I hope that this time we will get this PR merged...
Hi, I am deciding if I should use PyVCF to convert my proprietary tab separated text file to a .vcf file. I was planning on using PyVCF to make VCF record objects from my data then use the VCF writer object to write the VCF records to a file. Can I do this with the current version of PyVCF? Thank you for your help.
@martijnvermaat @jamescasbon @alimanfoo @gitanoqevaporelmundoentero
Not, sure why but I am still having a problem:
c = record.samples[2]
c.add_field('foo', 'bar')
Traceback (most recent call last):
File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 111, in
I also tried another method:
In [3]: c = next(r).samples[0]
In [4]: c.add_field('FOO', 'BAR')
and still getting the same kind of AtrributeError
@jamescasbon @martijnvermaat
I have to add that there is no add_field
method in the file module.py
. https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py
Any suggestions on how I can use add_field
method???
I do it like this, you can play with it a little bit.
from collections import namedtuple
for record in reader:
new_gt = '2/2'
new_CallData = namedtuple('CallData', samp_fmt._fields)
calldata = [new_gt] + list(record.samples[0].data[1:])
record.samples[0].data = new_CallData(*calldata)
@tomwitz this works for me. just make sure to add samp_fmt = sample.data._fields
I'd really like a way to add and edit format fields to the sample call data when writing a VCF file. Specifically, what I'm looking to do is add the 'FT' format field, then fill it in for each call. Any suggestions for a workaround much appreciated.