pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

I want to use pysam to get the PS number in the vcf file #1251

Open ZON-ZONG-MIN opened 8 months ago

ZON-ZONG-MIN commented 8 months ago

My vcf file is [ ONT R10.4.1 provided by EPI2ME ] aws s3 ls --no-sign-request s3://ont-open-data/giab_2023.05/analysis/variant_calling/hg002_hac_60x/hg002.wf_snp.vcf.gz

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  hg002
chr1    43042   .       G       A       6.98    PASS    F       GT:GQ:DP:AF:PS  0|1:6:78:0.7692:43042
chr1    43205   .       C       T       6.69    PASS    F       GT:GQ:DP:AF:PS  1/1:6:77:0.7662:.
chr1    43374   .       C       T       7.70    PASS    F       GT:GQ:DP:AF:PS  1/1:7:77:0.7792:.
chr1    43586   .       C       A       8.71    PASS    F       GT:GQ:DP:AF:PS  0|1:8:76:0.7763:43042
chr1    43666   .       C       A       8.69    PASS    F       GT:GQ:DP:AF:PS  0|1:8:76:0.7368:43042
chr1    43796   .       C       CAAAA,CAAA      2.18    PASS    F       GT:GQ:DP:AF:PS  1/2:2:74:0.2568:.
....

I tried the following sample code

from pysam import VariantFile
bcf_in = VariantFile("hg002.wf_snp.vcf.gz")  # auto-detect input format
bcf_out = VariantFile('-', 'w', header=bcf_in.header)

for rec in bcf_in.fetch():
     print(rec.format["PS"])

but I get these outputs

<pysam.libcbcf.VariantMetadata object at 0x7f017a3ef130>
<pysam.libcbcf.VariantMetadata object at 0x7f017a3ef100>
<pysam.libcbcf.VariantMetadata object at 0x7f017a3e6160>
<pysam.libcbcf.VariantMetadata object at 0x7f017a3e6160>
<pysam.libcbcf.VariantMetadata object at 0x7f017a3ef220>
....

How should I get the correct PS numbers? Thank you!

awgymer commented 7 months ago

I believe you need to access the value from samples not format. format is actually a mapping of the formats from the header.

# access by sample name
rec.samples['hg002']['PS']

# access by sample index
rec.samples[0]['PS']