jamescasbon / PyVCF

A Variant Call Format reader for Python.
http://pyvcf.readthedocs.org/en/latest/index.html
Other
402 stars 200 forks source link

Issue with fetching whole chromosome #180

Closed ckeown closed 9 years ago

ckeown commented 10 years ago

The nugget of code below works beautifully:

import vcf
vcf_reader = vcf.Reader(open('../SNPs/snps_spretus_chr19.recode.vcf.gz', 'r'))
snps_vcf = vcf_reader.fetch('19', 0, 200000000)                                                                                                                                                         snps = {}
alt_snps = {}
for snp in snps_vcf:
    snps[int(snp.POS)] = snp.REF.upper()
    alt_snps[int(snp.POS)] = [str(alt_snp).upper() for alt_snp in snp.ALT]

I want to change it, however, so that I can read the whole chromosome in without having to specify a range on the positions (because ultimately I want to run this for all chromosomes). I have tried snps_vcf = vcf_reader.fetch('19', 0). However, I get an error:

TypeError: 'NoneType' object is not iterable

I have similarly tried leaving the starting position out as well, but that gives me this error:

TypeError: fetch() takes at least 3 arguments (2 given)

Any suggestions? The documentation states that I should be able to run both of the commands above. I would include the version number, but I can't seem to find it because vcf.__version__ doesn't exist. It was installed within the last week, however, using pip, so it's probably the most recent.

Thanks in advance for your help! chris

martijnvermaat commented 10 years ago

I'm guessing fetch is not the problem here, but snp.ALT is None.

martijnvermaat commented 10 years ago

By the way, if your VCF file is grouped by chromosome, you can also get records one chromosome at a time using itertools.groupby instead of fetch. This would be a bit more lightweight since fetch only works on tabix-indexed files and requires pysam.

For example:

import itertools
import operator
import vcf

reader = vcf.reader(filename='bla.vcf')

for chrom, records in itertools.groupby(reader, operator.attrgetter('CHROM')):
    print 'Chromosome:', chrom
    for record in records:
        print record
ckeown commented 10 years ago

Thanks for the response, Martijn. I'm pretty sure it's not snp.ALT because the below code gives me the same error:

import vcf
vcf_reader = vcf.Reader(open('../SNPs/snps_spretus_chr19.recode.vcf.gz', 'r'))
snps_vcf = vcf_reader.fetch('19', 0)
for snp in snps_vcf:
     print "Hello."

I will try the other method, but it's not ideal, because I can't guarantee the file is sorted. Thanks!

martijnvermaat commented 10 years ago

Could you paste the complete stack trace for the error?

shashidhar22 commented 9 years ago

Hi I am trying to read the cosmic mutations vcf file one chromosome at a time, and I am having the same issue as stated above, is there any solution to this? Below is my code:

cosmic_vcf = vcf.Reader(open('CosmicMutantExport.vcf.gz','r'))
for lines in cosmic_vcf.fetch(chrom):
    print(records.CHROM,records.POS, records.INFO['GENE'],records.REF,records.ALT)

where chrom is an iterator over the chromosomes.

martijnvermaat commented 9 years ago

The first argument to fetch should be a string, not an iterator.

Could you paste some minimal but complete code that fails? Perhaps as a gist with a small VCF file included?

What versions of Python/PyVCF/pysam are you using?

shashidhar22 commented 9 years ago

I am using python 2.7.6, pysam version 0.8.0 and pyvcf version 0.6.7

I tried the following pieces of code import pysam import vcf cosmic_vcf = vcf.Reader(open('CosmicMutantExport.vcf.gz','r')) for lines in cosmic_vcf.fetch('1'): print(records.CHROM,records.POS, records.INFO['GENE'],records.REF,records.ALT)

and got the following error

Traceback (most recent call last): File "test", line 4, in for lines in cosmic_vcf.fetch('1'): TypeError: fetch() takes at least 3 arguments (2 given)

Here is a sample from my vcf file

fileformat=VCFv4.1

source=COSMICv70

reference=GRCh37

fileDate=20140805

comment="Missing nucleotide details indicate ambiguity during curation process"

comment="URL stub for COSM ID field (use numeric portion of ID)='http://cancer.sanger.ac.uk/cosmic/mutation/overview?id='"

comment="REF and ALT sequences are both forward strand

INFO=

INFO=

INFO=

INFO=

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

1 69345 COSM911918 C A . . GENE=OR4F5;STRAND=+;CDS=c.255C>A;AA=p.I85I;CNT=1 1 69523 COSM426644 G T . . GENE=OR4F5;STRAND=+;CDS=c.433G>T;AA=p.G145C;CNT=1 1 69538 COSM75742 G A . . GENE=OR4F5;STRAND=+;CDS=c.448G>A;AA=p.V150M;CNT=1 1 69539 COSM1343690 T C . . GENE=OR4F5;STRAND=+;CDS=c.449T>C;AA=p.V150A;CNT=1 1 69540 COSM1560546 G T . . GENE=OR4F5;STRAND=+;CDS=c.450G>T;AA=p.V150V;CNT=1 1 69569 COSM1599955 T C . . GENE=OR4F5;STRAND=+;CDS=c.479T>C;AA=p.L160P;CNT=2 1 69591 COSM3419425 C T . . GENE=OR4F5;STRAND=+;CDS=c.501C>T;AA=p.V167V;CNT=1 1 861390 COSM460103 G C . . GENE=SAMD11;STRAND=+;CDS=c.69G>C;AA=p.P23P;CNT=1 1 865609 COSM336143 C T . . GENE=SAMD11;STRAND=+;CDS=c.147C>T;AA=p.P49P;CNT=1 1 865617 COSM3790304 C G . . GENE=SAMD11;STRAND=+;CDS=c.155C>G;AA=p.S52C;CNT=1

martijnvermaat commented 9 years ago

Ah indeed, omitting the start position is only possible with the current master branch, not with the released 0.6.7. (The commit is 4fba62c.)

You can try the current master branch with:

pip install -e git+https://github.com/jamescasbon/PyVCF.git#egg=PyVCF

As an alternative, you could use the code I suggested in an earlier comment.

martijnvermaat commented 9 years ago

I think this should also answer @ckeown's question (omitting the end position is possible in 0.6.7, but only returns one call at that exact position, or None if there is none).

shashidhar22 commented 9 years ago

Thank you that resolved the issue