Add VCF/BCF support for POS=0 coordinate

pd3 commented 1 year ago

Resolves #1571

daviesrob commented 1 year ago

This accidentally reverts the last htscodecs submodule update. Please could you update your local copy with git submodule update, and then fix up your bt-1871 branch? I'm not entirely sure, but I think you'd need to do something like:

( cd htscodecs && git pull && git checkout v1.4.0 )
git add htscodecs
git commit --amend

Also, if you rebase the branch on to the current develop the AppVeyor tests should now work again.

jkbonfield commented 1 year ago

git pull on develop just did it for me. You only need to do this sort of thing by hand if you've been editing the submodule I think.

pd3 commented 1 year ago

This gives me headaches, always manage to mess it up. Did what you suggested and now I've got

On branch bt-1871
Your branch and 'origin/bt-1871' have diverged,
and have 1 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

Should I merge as suggested and push the result?

EDIT: push -f fixed it. Hopefully

daviesrob commented 1 year ago

Hmm, this is a bit of a rabbit-hole. I don't think it's a good idea to change reg2bins(), as it would become non-compliant with the index specification. (BCF defers to SAM for this, and SAM specifically defines how the (-1, 0] interval should be handled.

Now for htslib it turns out that this is moot, because we don't index unmapped reads; we reject SAM positions < 0; and we've actually been pushing this interval into the (0, 1] bin in hts_idx_push() since version 1.4. However, it turns out the Picard puts this interval into bin 4860, as defined by the specification. I think that may be by accident though because htsjdk also prevents you from searching for it in the reg2bins() equivalent so presumably has the same problem you're trying to fix here...

My test file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=51304566>
##SAMPLE=<ID=SAMPLE_1>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE_1
1   0   ID1 C   G   .   PASS    .   GT  1/0
1   1   ID2 G   T   .   PASS    .   GT  1/0

Picard index dump (obtained by compiling htslib with -DDEBUG_INDEX:

$ ./tabix /tmp/example_p.vcf.gz 1
format='tbi', min_shift=14, n_lvls=5, n_bins=37449, l_meta=30 n=1, m=1, n_no_coor=0
======== BIN Index - tid=0, n_buckets=4, size=2
    bin=4680, level=4, parent=584, n_chunks=1, loff=0, interval=[536739841 - 536870912]
        chunk=0, u=434, v=462
    bin=4681, level=5, parent=585, n_chunks=1, loff=434, interval=[1 - 16384]
        chunk=0, u=462, v=23527424
======== LINEAR Index - tid=0, n_values=1
        entry=0, offset=434, interval=[1 - 16384]
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?
1   0   ID1 C   G   .   PASS    .   GT  1/0
1   1   ID2 G   T   .   PASS    .   GT  1/0

Note that the whole-chromosome search works because it asks for an interval wide enough to include bin 4680's official range, which means we also pick up the (zero-based) position -1 stuff in there. It wouldn't work if you asked for a smaller interval including that position though.

For maximum compatibility we probably want to make reg2bins() include bin 4680 (or its csi equivalent) when you ask for a range starting at -1. I don't think it needs to return the full list in the specification though because as far as I can make out intervals starting at -1 can only ever end up in bins 4680 or 0.

This should probably be done as a special case, so we can avoid issues with shifting negative numbers. We may also want to discuss if we should be indexing this position in a currently spec-compliant way, or updating the specification so it follows the practice we've been doing for a while (which would break htsjdk, but as I think it can't do the look-up, probably not too badly compared with the current state).

daviesrob commented 1 year ago

I've pushed up my progress so far on this:

Reverted the check that was added in insert_to_l() as a similar one is already done before calling the function.
Fixed a signed overflow in hts_itr_query() which caused it to check for numerous negative bin numbers when passed an end value of HTS_POS_MAX.
Work around the discrepancy about where the range [-1, 0) is stored in the binning index by looking for both options.
Allows range specifications for regions in VCF/BCF to start at base 0, and ensures that ranges like chr:-0 work correctly, both in hts_parse_region() and in the synced reader.
Adds some tabix tests for VCF/BCF files with variants at position 0.

The region parsing part adds a new flag HTS_PARSE_REF_START_MINUS_1 to make hts_parse_region() return a start position of -1 instead of zero for the beginning of the chromosome. I don't really like that name, better suggestions are welcome. Actually setting the flag so the revised start only applies to VCF and BCF files is a bit of a kludge at the moment. hts_itr_querys() doesn't get many clues at the moment as to what sort of file it's making an iterator for. Currently I take the somewhat hacky approach of testing its readrec argument to try to find out what soft of file it's reading, and also the hdr if it looks like something for tabix. This does work, but it might be better to pass the information in a more explicit way - which would require adding a new interface to the API.

samtools / htslib

Add VCF/BCF support for POS=0 coordinate #1573