samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Add VCF/BCF support for POS=0 coordinate #1573

Open pd3 opened 1 year ago

pd3 commented 1 year ago

Resolves #1571

See also https://github.com/samtools/bcftools/pull/1875

daviesrob commented 1 year ago

This accidentally reverts the last htscodecs submodule update. Please could you update your local copy with git submodule update, and then fix up your bt-1871 branch? I'm not entirely sure, but I think you'd need to do something like:

( cd htscodecs && git pull && git checkout v1.4.0 )
git add htscodecs
git commit --amend

Also, if you rebase the branch on to the current develop the AppVeyor tests should now work again.

jkbonfield commented 1 year ago

git pull on develop just did it for me. You only need to do this sort of thing by hand if you've been editing the submodule I think.

pd3 commented 1 year ago

This gives me headaches, always manage to mess it up. Did what you suggested and now I've got

On branch bt-1871
Your branch and 'origin/bt-1871' have diverged,
and have 1 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

Should I merge as suggested and push the result?

EDIT: push -f fixed it. Hopefully

daviesrob commented 1 year ago

Hmm, this is a bit of a rabbit-hole. I don't think it's a good idea to change reg2bins(), as it would become non-compliant with the index specification. (BCF defers to SAM for this, and SAM specifically defines how the (-1, 0] interval should be handled.

Now for htslib it turns out that this is moot, because we don't index unmapped reads; we reject SAM positions < 0; and we've actually been pushing this interval into the (0, 1] bin in hts_idx_push() since version 1.4. However, it turns out the Picard puts this interval into bin 4860, as defined by the specification. I think that may be by accident though because htsjdk also prevents you from searching for it in the reg2bins() equivalent so presumably has the same problem you're trying to fix here...

My test file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=51304566>
##SAMPLE=<ID=SAMPLE_1>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE_1
1   0   ID1 C   G   .   PASS    .   GT  1/0
1   1   ID2 G   T   .   PASS    .   GT  1/0

Picard index dump (obtained by compiling htslib with -DDEBUG_INDEX:

$ ./tabix /tmp/example_p.vcf.gz 1
format='tbi', min_shift=14, n_lvls=5, n_bins=37449, l_meta=30 n=1, m=1, n_no_coor=0
======== BIN Index - tid=0, n_buckets=4, size=2
    bin=4680, level=4, parent=584, n_chunks=1, loff=0, interval=[536739841 - 536870912]
        chunk=0, u=434, v=462
    bin=4681, level=5, parent=585, n_chunks=1, loff=434, interval=[1 - 16384]
        chunk=0, u=462, v=23527424
======== LINEAR Index - tid=0, n_values=1
        entry=0, offset=434, interval=[1 - 16384]
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?
1   0   ID1 C   G   .   PASS    .   GT  1/0
1   1   ID2 G   T   .   PASS    .   GT  1/0

Note that the whole-chromosome search works because it asks for an interval wide enough to include bin 4680's official range, which means we also pick up the (zero-based) position -1 stuff in there. It wouldn't work if you asked for a smaller interval including that position though.

For maximum compatibility we probably want to make reg2bins() include bin 4680 (or its csi equivalent) when you ask for a range starting at -1. I don't think it needs to return the full list in the specification though because as far as I can make out intervals starting at -1 can only ever end up in bins 4680 or 0.

This should probably be done as a special case, so we can avoid issues with shifting negative numbers. We may also want to discuss if we should be indexing this position in a currently spec-compliant way, or updating the specification so it follows the practice we've been doing for a while (which would break htsjdk, but as I think it can't do the look-up, probably not too badly compared with the current state).

daviesrob commented 1 year ago

I've pushed up my progress so far on this:

The region parsing part adds a new flag HTS_PARSE_REF_START_MINUS_1 to make hts_parse_region() return a start position of -1 instead of zero for the beginning of the chromosome. I don't really like that name, better suggestions are welcome. Actually setting the flag so the revised start only applies to VCF and BCF files is a bit of a kludge at the moment. hts_itr_querys() doesn't get many clues at the moment as to what sort of file it's making an iterator for. Currently I take the somewhat hacky approach of testing its readrec argument to try to find out what soft of file it's reading, and also the hdr if it looks like something for tabix. This does work, but it might be better to pass the information in a more explicit way - which would require adding a new interface to the API.