pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
776 stars 275 forks source link

How to get actual GT data? #564

Closed BioComSoftware closed 6 years ago

BioComSoftware commented 6 years ago

Perhaps I'm missing something obvious, but I've been all over the documentation and I can't find this.

I have BCF variant file, which is the amalgamated result of 70 samples. So, there's 70 GT:PL references.

I'm opening a BCF file as follows...

bcf_in = VariantFile(self.bcffile)  # auto-detect input format        
for rec in bcf_in.fetch():
    print(type(rec),rec)

When I just print the rec (as above: print(type(rec),rec)) I get...

<class 'pysam.libcbcf.VariantRecord'> 1 895427  .   G   C   228 PASS    VDB=0.2049;SGB=-0.693136;MQSB=1;MQ0F=0;MQ=60;RPB=1;MQB=1;BQB=1;ICB=1;HOB=0.5;DP=3155;DP4=19,87,1634,1313;AN=140;AC=139  GT:PL   1/1:255,105,0   1/1:255,163,0   1/1:255,12,0    1/1:255,105,0   1/1:255,82,0    1/1:255,103,01/1:255,25,0   1/1:255,59,0    1/1:255,35,0    1/1:60,6,0  1/1:255,63,0    1/1:255,70,0    1/1:255,75,0    1/1:255,159,0   1/1:255,42,0    1/1:255,47,0    1/1:255,30,0    1/1:255,33,0    1/1:255,104,0   1/1:255,98,0    1/1:255,103,0   1/1:255,60,0    1/1:255,75,0    1/1:255,137,0   1/1:255,77,0    1/1:255,41,0    1/1:255,144,0   1/1:255,101,0   1/1:255,81,0    1/1:255,40,0    1/1:255,148,0   1/1:255,163,0   1/1:255,134,0   1/1:255,92,0    1/1:255,24,0    1/1:255,66,0    1/1:255,148,0   1/1:255,59,0    1/1:255,102,0   1/1:255,105,0   1/1:255,78,0    1/1:255,112,0   1/1:255,85,0    1/1:255,70,0    1/1:255,100,0   1/1:255,33,0    1/1:255,105,0   1/1:255,157,0   1/1:255,77,0    1/1:255,111,0   1/1:255,92,0    1/1:255,134,0   1/1:255,41,0    1/1:255,178,0   1/1:255,114,0   1/1:255,78,0    0/1:255,0,19    1/1:255,40,0    1/1:255,0,6 1/1:255,211,0   1/1:255,172,0   1/1:122,15,0    1/1:255,39,0    1/1:255,76,0    1/1:255,163,0   1/1:255,76,0    1/1:255,96,0    1/1:255,69,0    1/1:255,90,0    1/1:255,42,0

Now, I can get nearly every peice of info out of the rec (including rec.header, rec.id, rec.info, rec.pos, rec.qual, rec.ref, rec.rid, rec.rlen, and rec.samples) But the one thing I CANT get is the actual GT:PL data itself. (the 1/1:255,69,0 part).

Is there an attribute/method for extracting this? Or do I just need to convert rec to a string and split() it?

P.s.

When I print(rec.samples) I get the sample header names...not the data :(

BioComSoftware commented 6 years ago

Bump

bioinformed commented 6 years ago

You want: vcf.samples[name_or_index][‘GT’]

Sent from my tiny remote device.

On Nov 16, 2017, at 3:15 AM, BioComSoftware notifications@github.com wrote:

Bump

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

BioComSoftware commented 6 years ago

Awesome! Thanks!

Kevin Jacobs mailto:notifications@github.com 16 November 2017 at 09:35 You want: vcf.samples[name_or_index][‘GT’]

Sent from my tiny remote device.

On Nov 16, 2017, at 3:15 AM, BioComSoftware notifications@github.com wrote:

Bump

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pysam-developers/pysam/issues/564#issuecomment-344851811, or mute the thread https://github.com/notifications/unsubscribe-auth/AGX53sJYooX36cOU5a-gzaOfMSrRLvcZks5s2_PVgaJpZM4QSKtT.

mehaffeym commented 3 weeks ago

This is exactly what I was looking for that wasn't in the "Working with VCF/BCF formatted files" section. It would be super helpful to add a short example of this command there as well.