pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

TypeError: values expected to be 3-tuple, given len=2 even for haploid GT values #1225

Open karinkumar opened 12 months ago

karinkumar commented 12 months ago

Hello I am trying to write out GT and PL fields for haploid samples. In the header PL must have Number = G. However, even though my GT is either (0,) or (1,) I get this error: TypeError: values expected to be 3-tuple, given len=2 I'm not sure why because clearly my GT is haploid. Furthermore, if I enter the sample['GT] as a 2-tuple (0,0). It expects the PL to be a 6-tuple. However, clearly a 2 tuple GT is diploid rather than triploid. How do I resolve this issue and tell the vcf that I want haploid PL.

My code and full error message is below:


vcf_reader = VariantFile(input_vcf, 'r')

#create header
vcfh = pysam.VariantHeader()
#add samples
for s in list((vcf_reader.header.samples)): 
    vcfh.add_sample(s)

#vcfh = vcf_reader.header
vcfh.add_meta('contig', items=[('ID','chr20')])
vcfh.add_meta('FORMAT', items=[('ID',"PL"), ('Number', 'G'), ('Type','Integer'),
    ('Description','Phred Scaled Likelihood')])
vcfh.add_meta('FORMAT', items=[('ID',"GT"), ('Number', '1'), ('Type','String'),
    ('Description','True Genotype')])

#add samples

vcf_out = VariantFile(output_vcf, 'w', header=vcfh)

N = 3202

for rec in vcf_reader: 
    #print(rec) 
    #create new record
    r=vcf_out.new_record(contig=rec.chrom, start=rec.pos - 1 , stop=rec.pos,
            alleles=[str(rec.ref),str(rec.alts)])
    #get AVG original depth per SITE
    dp_orig = rec.info['DP']/N
    #print("dp orig", dp_orig)
    #get original GT
    GT_dict = [sample['GT'] for sample in rec.samples.values()]   
    #downsample the depth for the SITE: 
    DP = np.random.poisson(dp_orig*coverage_frac, 1)[0]
    if DP!=0: #don't print the variant if it has no coverage, easy to count sites this way
        for n, samp in enumerate(r.samples.values()):
            #print(n, samp)
            samp['GT'] = (GT_dict[n][0],) #add true genotype
            print(len(samp['GT']))
            samp['PL']  = simulate_DP_PL(GT_dict[n], DP, error)
    # Write the processed variant to the output VCF file
        vcf_out.write(r)

vcf_reader.close()
vcf_out.close()
TypeError                                 Traceback (most recent call last)
Cell In[64], line 43
     41         samp['GT'] = (GT_dict[n][0],) #add true genotype
     42         print(len(samp['GT']))
---> 43         samp['PL']  = simulate_DP_PL(GT_dict[n], DP, error)
     44 # Write the processed variant to the output VCF file
     45     vcf_out.write(r)

File pysam/libcbcf.pyx:3546, in pysam.libcbcf.VariantRecordSample.__setitem__()

File pysam/libcbcf.pyx:879, in pysam.libcbcf.bcf_format_set_value()

File pysam/libcbcf.pyx:603, in pysam.libcbcf.bcf_check_values()

TypeError: values expected to be 3-tuple, given len=2
astaric commented 4 months ago

@karinkumar, please make sure that rec.alts in alleles=[str(rec.ref),str(rec.alts)]) contains a single allele.

I rewrote your example to skip the input file and just create one variant with three samples and it seems to accept 2-tuple for PL just fine.

import pysam
from pysam import VariantFile

vcfh = pysam.VariantHeader()
for s in ["A", "B", "C"]:
    vcfh.add_sample(s)
vcfh.add_meta("contig", items=[("ID", "chr20")])
vcfh.add_meta(
    "FORMAT",
    items=[
        ("ID", "PL"),
        ("Number", "G"),
        ("Type", "Integer"),
        ("Description", "Phred Scaled Likelihood"),
    ],
)
vcfh.add_meta(
    "FORMAT",
    items=[("ID", "GT"), ("Number", "1"), ("Type", "String"), ("Description", "True Genotype")],
)
vcf_out = VariantFile("-", "w", header=vcfh)
r = vcf_out.new_record(contig="chr20", start=1, stop=1, alleles=["C", "T"])
for n, samp in enumerate(r.samples.values()):
    samp["GT"] = (0,)
    samp["PL"] = [0, 1]
    vcf_out.write(r)
vcf_out.close()

I get the same error as you do if I change the alleles to alleles=["C", str(("T", "G"))]), which would happen if a record in your input vcf would contain more than one alt allele.