quinlan-lab / vcf2db

create a gemini-compatible database from a VCF
MIT License
55 stars 13 forks source link

vcf2db without samples #38

Open parkesK opened 6 years ago

parkesK commented 6 years ago

Hi there,

I annotated a VCF of Clinvar variants using vcfanno and would like to create a database using vcf2db. However, there are no samples in the VCF. I tried adding a fake sample in the header and then making a ped file to go along with this, but am getting an error with this:

File "cyvcf2/cyvcf2.pyx", line 932, in cyvcf2.cyvcf2.Variant.gt_bases.get (cyvcf2/cyvcf2.c:22059) ValueError: range() step argument must not be zero

I am guessing this is because there are no genotypes? Is there anyway to upload this vcf without samples? I have attached 2 VCFs that are each 1000 lines. The one that has _Copy in the title is the original; the other one has the fake sample added. I also attached my fake ped file.

Thanks, Parkes

clinvar_noncoding_pathogenic_headed_NoAnno_Split_VEP_annotated_HeaderFixed_Copy_1000.vcf.gz

fake.ped.gz

clinvar_noncoding_pathogenic_headed_NoAnno_Split_VEP_annotated_HeaderFixed_1000.vcf.gz

Phillip-a-richmond commented 6 years ago

+1 on this request

davemcg commented 6 years ago

Your faux vcf is wrong - you need to add, bare minimum, GT to the FORMAT column and a fake genotype (0/1) to the fake sample 20 column. You can't just leave them blank.

Also, you need to add a FORMAT header field to define what GT is (https://samtools.github.io/hts-specs/VCFv4.2.pdf).

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

davemcg commented 6 years ago

Here's a super barebones python script that will add a fake FORMAT and sample field to a vcf. You have to give a gzipped vcf as the first argument and a fake sample name (no spaces or special characters!) as the second argument to the script.

python3 this_script.py clinvar.vcf.gz FAKESAMPLENAME | bgzip > new_clinvar.vcf.gz

#!/usr/local/bin/python3

import sys
import gzip

vcf = sys.argv[1]
sample_name = sys.argv[2]

for line in gzip.open(vcf,'r'):
    line = line.decode('utf-8')
    if line[0:2] == '##':
        print(line[:-1])
    elif line[0:2] == '#C':
        print('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
        print('##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">')
        print('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
        print(line[:-1] + '\tFORMAT\t' + sample_name)
    else:
        line = line.split()
        line = line[:-1]
        print('\t'.join(line) + '\t.\tGT:GQ:DP\t0/1:100:100')