pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Allow accessing "END" as an info field to enable parsing of inter-chr SVs #1258

Closed jrobinso closed 8 months ago

jrobinso commented 8 months ago

I am getting an error with inter-chromosome variants where "END" might be < position, because it is on another chromosome. For example (VCF 4.2)

1   564466  26582   N   <TRA>   .   PASS    PRECISE;SVMETHOD=Snifflesv1.0.2;CHR2=MT;END=3916;STD_quant_start=198.695999;STD_quant_stop=234.736235;Kurtosis_quant_start=0.913054;Kurtosis_quant_stop=-0.183504;SVTYPE=TRA;SUPTYPE=SR;SVLEN=-1199826434;STRANDS=-+;STRANDS2=2,9,2,9;RE=11 GT:DR:DV    ./.:.:11

I get a warning that END is < start, and the "stop" field is set to start + 1. Fine as far as it goes, but when I try to access "END" as an info field I get this error

KeyError: END is a reserved attribute; access is via record.stop'

Which would be o.k., but record.stop does not contain the correct value.

In VCF 4.4 "END" is defined explicitly as the end position on CHROM, however this was not the case in previous versions, and there are many files in existence that use "CHR2" and "END" to describe inter-chr svs. My request would be to allow accessing "END" as an info record so these files can be used with pysam. A further suggestion would be to not interpret "END" as a position on CHROM if the info field "CHR2" is present, at least for VCF versions < 4.4.

Test file attached.

intra_tra.vcf.zip

jmarshall commented 8 months ago

Allowing access to the raw record.info["END"] field as well as the lightly processed record.stop/.rlen values is reasonable IMHO. This is tracked as #973, but could be tracked here also as another reminder.

As for the abuse of INFO/END in recording interchromosomal variants: this has always been invalid and has always caused problems. END has been explicitly described as the end position on CHROM in the v4.3 and v4.4 specifications for over four years. Prior to that, and today in the v4.1 and v4.2 specifications, it was more implicit; however it could be inferred from the spec text that END refers to the same contig as POS, and certainly the interchromosomal (mis)usage was never described in the spec. See samtools/hts-specs#425 and samtools/hts-specs#436 for details.

(The v4.1 and v4.2 specifications are largely in maintenance mode, but you might want to suggest that this be made explicit in those older documents too. This would be something for the VCF specification maintainers to consider.)

So these variant callers that use CHR2/END as the other side of an interchromosomal variant with END referring to a coordinate on CHR2 have always been outwith the intention of the VCF specification. The INFO/END field has long been used by htslib and htsjdk in constructing indexes, and such CHR2/END values lead to invalid indexes. In the past, this led to interval queries silently failing to return variant records within the queried interval (see https://github.com/samtools/hts-specs/issues/425#issuecomment-505752650), which is about the worst possible failure mode in this area.

By now, most SV callers appear to have been updated and no longer use this CHR2/END notation. However it appears that sniffles, which the variants in your test file are marked as being called by, may not have been completely updated — either in the version your test file used or in the current version.

Your proposal about interpreting END differently in the presence of CHR2 is not uninteresting. That is a VCF spec issue rather than a pysam issue, so you might consider raising it as an hts-specs issue. However considering that four years has passed since this clarification was dealt with and that the representation causes indexing failures with current tools, I suspect that there will not be much appetite for further change here.

jrobinso commented 8 months ago

@jmarshall Thanks for the background, as always, and considering this. I actually implemented a workaround this afternoon (igv-reports), since the string representation of the variant record is available I just parse END out if needed.

If there's a CHR2 info field I would naturally expect a POS2. So I myself would not propose treating END differently in the spec, I was just noting that if CHR2 is present its very likely END does not refer to the the "chrom" sequence, so the "stop" value computed might be incorrect. So my suggestion would be that, maybe, pysam ignore END in that case.

The files I'm dealing with now are from an important cancer dataset in VCF 4.2, and produced several years ago. Although the 4.2 spec might be obsolete the files are still out there, so I have to deal with them, this has always been an issue for me in the bam / vcf world.

I'm fine with closing this as I have a workaround, and the same issue is tracked elsewhere.