jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
404 stars 200 forks source link

add/edit call data #82

Open alimanfoo opened 11 years ago

alimanfoo commented 11 years ago

I'd really like a way to add and edit format fields to the sample call data when writing a VCF file. Specifically, what I'm looking to do is add the 'FT' format field, then fill it in for each call. Any suggestions for a workaround much appreciated.

martijnvermaat commented 11 years ago

It is a bit of a hack, but currently I use something like this (this is for INFO, but I assume you could do something similar for FORMAT):

from random import random
from vcf.parser import _Info as VcfInfo, field_counts as vcf_field_counts
import vcf

reader = vcf.Reader(open('original.vcf'))

reader.infos['EXAMPLE_INFO'] = VcfInfo(
    'EXAMPLE_INFO', vcf_field_counts['A'], 'Float',
    'Example info value for each allele')

writer = vcf.Writer(open('with_example_info.vcf', 'w'), reader,
                    lineterminator='\n')

for record in reader:
    record.add_info('EXAMPLE_INFO', [random() for _ in record.ALT])
    writer.write_record(record)

It's not pretty, but I think in general modifying and writing VCF files is not covered well (yet) in PyVCF.

jamescasbon commented 11 years ago

Its not easy at the minute. For performance reasons, we store the data from each cell in a namedtuple (see https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py#L529 ). These are immutable, and hence its hard to add data to a CallData.

There are two ways I can imagine doing this off the top of my head:

  1. add a mode that allows you to return mutable CallData objects (using dict or something).
  2. add a method to CallData to add a field

Any opinions, patches on these approaches welcome ;)

alimanfoo commented 11 years ago

Hi Martijn,

Thanks for the reply, I've been using that pattern to add INFO fields to the header and INFO values to variants and it's working well. The problem I have is that I now want to modify the actual call data, not just add a new FORMAT field to the header, and as James says the call data is currently immutable because it's stored in a named tuple.

jamescasbon commented 11 years ago

Checkout this branch... https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data

In [1]: import vcf
In [2]: r = vcf.Reader(open('vcf/test/miseq.vcf'))
In [3]: c = r.next().samples[0]
In [4]: c.add_field('FOO', 'BAR')
In [5]: c
Out[5]: Call(sample=Sample1, CallData(GT=0/1, AD=[244, 6], DP=250, GQ=99.0, PL=[374, 0, 3216], VF=0.024, FOO=BAR))

So it works, but no effort is made to maintain ordering or consistency between the fields expected from the record and in the call data. Thoughts on a suitable API?

jamescasbon commented 11 years ago

@alimanfoo ?? Any feedback?

alimanfoo commented 11 years ago

Hi James, apologies, I've had my head in another problem. Thanks for scuba quick response, I'll try out your patch ASAP and get back to you re the API.

On 20 Jan 2013, at 20:26, James Casbon notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo ?? Any feedback?

— Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/82#issuecomment-12476793.

gitanoqevaporelmundoentero commented 11 years ago

Suggested branch changes in https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data are not going to be included in the master branch and future releases?? Being able to add unexistent FORMAT fields is a quite interesting fact that can easily be performed with the mentioned changes in model.py. At least I'm using them in this way, for including allele frequencies in the format fields:

            c = record.samples[index]
            try:
                c.add_field('FREQ', 'bar')
            except ValueError:
                pass
dawe commented 10 years ago

The issue is still open, any plans to close it? Meanwhile, the "easiest" way to do it is to create a new record copying all information from the parsed one, create a new namedtuple (named CallData), create new values for that and add modified values to the new tuple. In the end one sample.data = newcalldata. I had to use it to modify some values, like this:

# vcf_parser is a VCFReader
# vcf_writer is a VCFWriter
import collections
import copy

f_keys = vcf_parser.formats.keys() #it's an ordered dict
record = vcf_parser.next()
new_record = copy.deepcopy(record)
for sx in range(len(record.samples)):
  new_record.samples[sx].data = collections.namedtuple('CallData', f_keys)
  f_vals = [record.samples[sx].data[vx] for vx in range(len(f_keys))]
  handy_dict = dict(zip(f_keys, f_vals))
  for f in ['EQ', 'SQ', 'NQ', 'LQ', 'RQ']: #some fields i wanted to modify
    handy_dict[f] = [record.samples[sx][f][0]]
  if handy_dict['GT'] == '2': # I also wanted to modify GT string here
    handy_dict['GT'] = '0'
  new_vals = [handy_dict[x] for x in f_keys]  
  # finally set CallData
  new_record.samples[sx].data = new_record.samples[sx].data._make(new_vals)
vcf_writer.write_record(del_record)

It's ugly, but at least it works. I'm not sure if the deepcopy is appropriate here

gitanoqevaporelmundoentero commented 10 years ago

@dawe : You can check the open pull request in #136 . I'm currently dealing with most of the common vcf edit tasks (at least those affecting my daily work) in the fork related to the pull request (https://github.com/gitanoqevaporelmundoentero/PyVCF). I'm actively using this fork until it gets merged with the official project.

martijnvermaat commented 10 years ago

The issue hasn't been resolved yet, so let's keep this one open. It contains some useful links and context and can still be used for general discussion.

The specific implementation in #136 can be discussed there.

jdidion commented 10 years ago

I think this is easily solvable by using recordtype as a drop-in replacement for namedtuple: https://pypi.python.org/pypi/recordtype

daniellieber-claritas commented 9 years ago

I've read the discussion above and in #136 but am still unclear -- is there currently a way to add/edit genotype fields?

gitanoqevaporelmundoentero commented 9 years ago

Since I didn't get any feedback on the open pull request in https://github.com/jamescasbon/PyVCF/pull/136, I continued developing the add/edit functions on a separate and private repository, which is currently working. In next weeks I'll open a new pull request with my changes, and I hope that this time we will get this PR merged...

jaebird123 commented 7 years ago

Hi, I am deciding if I should use PyVCF to convert my proprietary tab separated text file to a .vcf file. I was planning on using PyVCF to make VCF record objects from my data then use the VCF writer object to write the VCF records to a file. Can I do this with the current version of PyVCF? Thank you for your help.

everestial commented 7 years ago

@martijnvermaat @jamescasbon @alimanfoo @gitanoqevaporelmundoentero

Not, sure why but I am still having a problem:

c = record.samples[2] 
c.add_field('foo', 'bar')

Traceback (most recent call last): File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 111, in c.add_field('foo', 'bar') AttributeError: '_Call' object has no attribute 'add_field'

I also tried another method:

In [3]: c = next(r).samples[0]
In [4]: c.add_field('FOO', 'BAR')

and still getting the same kind of AtrributeError

everestial commented 7 years ago

@jamescasbon @martijnvermaat

I have to add that there is no add_field method in the file module.py. https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py

Any suggestions on how I can use add_field method???

tomwitz commented 3 years ago

I do it like this, you can play with it a little bit.

from collections import namedtuple

for record in reader:
        new_gt = '2/2'
        new_CallData = namedtuple('CallData', samp_fmt._fields)
        calldata = [new_gt] + list(record.samples[0].data[1:])
        record.samples[0].data = new_CallData(*calldata)
healthcare-mikeli commented 2 years ago

@tomwitz this works for me. just make sure to add samp_fmt = sample.data._fields