samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Unexpected handling of invalid (non-integer) `POS` values #1570

Closed colin-nolan closed 1 year ago

colin-nolan commented 1 year ago

According to the VCF specification, POS must be an integer:

POS — position: The reference position, with the 1st base having position 1. ... Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)

If a variant's POS value contains a string that doesn't encode an integer:

$ cat example.vcf
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=51304566>
##SAMPLE=<ID=SAMPLE_1>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE_1
1       stuff   ID1     G       T       .       PASS    .       GT      1/0
1       1.5   ID2    G       T       .       PASS    .       GT      1/0

BCFtools silently ignores the issue and parses these invalid positions into valid ones:

$ ./bcftools view -H example.vcf
1       0       ID1     G       T       .       PASS    .       GT      1/0
1       1       ID2     G       T       .       PASS    .       GT      1/0
$ echo $?
0

This happens because hts_str2int doesn't have any checks for non-int values: https://github.com/samtools/htslib/blob/008eabd/textutils_internal.h#L218 (which I think mirrors strtol's behaviour).

It would be a nice to have for BCFtools to fail upon detection of invalid POS values. For example, column swapping would get detected, and not potentially result in a valid VCF full of 0positions! A valid argument against doing anything about this is GIGO. However, BCFtools certainly validates some things - and after doing an upgrade, I noticed the number of checks are increasing.

colin-nolan commented 1 year ago

Thanks @pd3 for moving (I'm never sure which repo bcftools/htslib I should report things on).