Open rickymagner opened 1 year ago
Here is the issue in the code. In libcbcf
here there is the bcf_sync_end
method that gets called on a VariantRecord every time before writing it to the output file. This contains the following block:
cdef int end_id = bcf_header_get_info_id(record.header.ptr, b'END')
# …
# Delete INFO/END if no alleles are present or if rlen is equal to len(ref)
# Always keep END for symbolic alleles
if not has_symbolic_allele(record) and (not record.ptr.n_allele or record.ptr.rlen == ref_len):
# If INFO/END is not defined in the header, it doesn't exist in the record
if end_id >= 0:
info = bcf_get_info(hdr, record.ptr, b'END')
if info and info.vptr:
if bcf_update_info(hdr, record.ptr, b'END', NULL, 0, info.type) < 0:
raise ValueError('Unable to delete END')
else:
# Create END header, if not present
if end_id < 0:
record.header.info.add('END', number=1, type='Integer', description='Stop position of the interval')
# Update to reflect stop position
bcf_info_set_value(record, b'END', record.ptr.pos + record.ptr.rlen)
You can see the condition record.ptr.rlen == ref_len
should occur exactly when there is a deletion, but not when there's an insertion, which means it overwrites the END
field to remove it in this case. I will submit a PR soon attempting to fix this.
Hi,
I have a minimal example VCF ("example.vcf") modified from the example in the VCF spec:
The first variant is a deletion whereas the second is an insertion. I copy this file and run
bcftools view -o example.vcf.gz example.vcf
andbcftools index -t example.vcf.gz
. The follow script meant to addINFO/END
to each variant has bizarre behavior:The result is:
In other words, the
END
attribute is added as expected only for insertions, but not deletions. Why is this happening? Is there any way to force the insertion behavior onto the deletions?As an aside bit of feedback: I know that the
END
info field is for some reason distinguished in pysam in ways that aren't super well-documented, and that there have been a lot of issues over the years about how it can be a bit tricky and counterintuitive to work with (including the header line adding hack that's necessary to get it into the INFO fields). For example, see any of these issues: here, here, or here. I wonder if the developers could revisit the choice to treat theEND
field as separate from any other INFO field, especially since it's crucial for many types of SV data analysis.