pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
776 stars 274 forks source link

`GT` field not set when creating second instance of record. #1308

Open awgymer opened 2 weeks ago

awgymer commented 2 weeks ago

When generating some rows for tests I discovered that the GT field seems to be unset on the creation of a second identical record. Please see the below minimal example. (pysam v.0.21.0)

vcfh = pysam.VariantHeader()
vcfh.contigs.add("chr1")
vcfh.add_sample("sample1")
vcfh.formats.add("GT", "1", "String", "Genotype")

vd = {
    "id": "INS_1",
    "contig": "chr1",
    "start": 10,
    "stop": 15,
    "alleles": ["A", "TCGA"],
    "samples": [
        {
            "GT": (0, 0)
        }
    ],
}
rec = vcfh.new_record(**vd)

rec2 = vcfh.new_record(**vd)

rec3 = vcfh.copy().new_record(**vd)

rec.samples[0]['GT']
# (0, 0)

rec2.samples[0]['GT']
# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "pysam/libcbcf.pyx", line 3541, in pysam.libcbcf.VariantRecordSample.__getitem__
#  File "pysam/libcbcf.pyx", line 813, in pysam.libcbcf.bcf_format_get_value
# KeyError: 'invalid FORMAT: GT'

rec3.samples[0]['GT']
# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "pysam/libcbcf.pyx", line 3541, in pysam.libcbcf.VariantRecordSample.__getitem__
#  File "pysam/libcbcf.pyx", line 813, in pysam.libcbcf.bcf_format_get_value
# KeyError: 'invalid FORMAT: GT'

# This is the most baffling part
# A completely fresh start in the same interpreter/script gives the error

vcfh2 = pysam.VariantHeader()
vcfh2.contigs.add("chr1")
vcfh2.add_sample("sample1")
vcfh2.formats.add("GT", "1", "String", "Genotype")
rec4 = vcfh2.new_record(**vd)

rec4.samples[0]['GT']
# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "pysam/libcbcf.pyx", line 3541, in pysam.libcbcf.VariantRecordSample.__getitem__
#  File "pysam/libcbcf.pyx", line 813, in pysam.libcbcf.bcf_format_get_value
# KeyError: 'invalid FORMAT: GT'

I'm not entirely sure why this happens. The header object still seems to have the GT field under formats in all the failing cases.

jmarshall commented 2 weeks ago

Thanks for the clear example code.

If you print out vd after each of the assignments, you'll see that your code is falling victim to good ol' Python pass-by-reference. So something under new_record() needs to be fixed to not alter its arguments.

awgymer commented 2 weeks ago

Oh ooft! I didn't even think to check that! It makes sense though as I did see a reference to pop once when I passed a dict instead of list of dicts to new_record.

Looks like this is the culprit: https://github.com/pysam-developers/pysam/blob/0787ca9da997b5911c00fd12584dad9741c82fb4/pysam/libcbcf.pyx#L2113-L2122

.pop used to get GT if it's in a sample (also it's popped from the kwargs but not sure that would matter since it wouldn't be a dict then).

My guess is that this is done rather than just using update on everything so that GT is always first into the dict (since dict insert order now matters).

I think simply setting the initial assignment line to not use pop will work fine, since the update won't mess up the order.

        if samples:
            for i, sample in enumerate(samples):
                if 'GT' in sample:
                    rec.samples[i]['GT'] = sample['GT']
                rec.samples[i].update(sample)