pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
782 stars 274 forks source link

[VCFs] Modifying END position works for insertions but not deletions #1200

Open rickymagner opened 1 year ago

rickymagner commented 1 year ago

Hi,

I have a minimal example VCF ("example.vcf") modified from the example in the VCF spec:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.14+htslib-1.14
##bcftools_viewCommand=view example.vcf; Date=Wed Jun 28 16:24:26 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
1   10000   .   GA  G   30  PASS    DP=14   GT  0|0 1|0 1/1
1   20000   .   T   TA  30  PASS    DP=11   GT  0|0 0|1 0/0

The first variant is a deletion whereas the second is an insertion. I copy this file and run bcftools view -o example.vcf.gz example.vcf and bcftools index -t example.vcf.gz. The follow script meant to add INFO/END to each variant has bizarre behavior:

import pysam

with pysam.VariantFile('example.vcf.gz') as vcf:
    header = vcf.header
    header.add_line(f'##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">')
    with pysam.VariantFile('out-example.vcf.gz', 'w', header=header) as out_vcf:
        for record in vcf:
            record.stop = record.pos + 1
            out_vcf.write(record)

The result is:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1,length=10000000>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##bcftools_viewVersion=1.14+htslib-1.14
##bcftools_viewCommand=view -o example.vcf.gz example.vcf; Date=Wed Jun 28 16:24:56 2023
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate in reference for SV">
##bcftools_viewCommand=view out-example.vcf.gz; Date=Wed Jun 28 16:26:52 2023
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
1       10000   .       GA      G       30      PASS    DP=14   GT      0|0     1|0     1/1
1       20000   .       T       TA      30      PASS    DP=11;END=20001 GT      0|0     0|1     0/0

In other words, the END attribute is added as expected only for insertions, but not deletions. Why is this happening? Is there any way to force the insertion behavior onto the deletions?

As an aside bit of feedback: I know that the END info field is for some reason distinguished in pysam in ways that aren't super well-documented, and that there have been a lot of issues over the years about how it can be a bit tricky and counterintuitive to work with (including the header line adding hack that's necessary to get it into the INFO fields). For example, see any of these issues: here, here, or here. I wonder if the developers could revisit the choice to treat the END field as separate from any other INFO field, especially since it's crucial for many types of SV data analysis.

rickymagner commented 1 year ago

Here is the issue in the code. In libcbcf here there is the bcf_sync_end method that gets called on a VariantRecord every time before writing it to the output file. This contains the following block:

    cdef int end_id = bcf_header_get_info_id(record.header.ptr, b'END')

    # …

    # Delete INFO/END if no alleles are present or if rlen is equal to len(ref)
    # Always keep END for symbolic alleles
    if not has_symbolic_allele(record) and (not record.ptr.n_allele or record.ptr.rlen == ref_len):
        # If INFO/END is not defined in the header, it doesn't exist in the record
        if end_id >= 0:
            info = bcf_get_info(hdr, record.ptr, b'END')
            if info and info.vptr:
                if bcf_update_info(hdr, record.ptr, b'END', NULL, 0, info.type) < 0:
                    raise ValueError('Unable to delete END')

    else:
        # Create END header, if not present
        if end_id < 0:
            record.header.info.add('END', number=1, type='Integer', description='Stop position of the interval')

        # Update to reflect stop position
        bcf_info_set_value(record, b'END', record.ptr.pos + record.ptr.rlen)

You can see the condition record.ptr.rlen == ref_len should occur exactly when there is a deletion, but not when there's an insertion, which means it overwrites the END field to remove it in this case. I will submit a PR soon attempting to fix this.