samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
633 stars 241 forks source link

Colons in VCF #CHROM columns are not parsed and throw error #2179

Closed groodri closed 1 month ago

groodri commented 2 months ago

Note: Specific regions were substituted by <gene>, <chrom> and <pos> to hide sensitive information. Real usage did not contain brackets.

After combining region of interest locations with bedtools to build a new reference FASTA:

bedtools getfasta -fi <genome>.fasta -bed <regions_of_interest>.bed -name+

And then doing variant calling with this new reference, we obtain a chromosome (VCF column #CHROM) name structure like so: <gene>::<chrom>:<pos>

Upon using bcftools concat to merge VCF files:

bcftools concat -a <file>.snp.vcf.gz <indel>.indel.vcf.gz -O v -o <file>.merged.vcf

This structure throws an error:

Checking the headers and starting positions of 2 files
[E::_regions_init_string] Could not parse the region(s): <gene>::<chrom>:<pos>

However, as of VCFv4.3 and later this structure should be legal. See the official VCFv4.3 changelog:

7.1 Changes to VCFv4.3 (...) The VCF specification previously disallowed colons (‘:’) in contig names to avoid confusion when parsing breakends, but this was unnecessary. Even with contig names containing colons, the breakend mate position notation can be unambiguously parsed because the “:pos” part is always present.

This was tested on bcftools=1.20, installed from source code, and previously bcftools=1.15.1, installed on conda.

I tried to confirm which HTSlib version was being compiled, but I couldn't figure it out. Does the current release of bcftools support VCFv4.3 and later?

jkbonfield commented 1 month ago

While colons within chromosome names are legal within the VCF specification, this was a huge mistake which sadly we cannot undo. They create ambiguities in parsing.

We've seen this before, where people create a contig name "chr1" and then another "chr1:1000-2000" to indicate a portion of that chromosome. Only now how do you specify the range of chr1 that happens to cover 1000-2000? Arguably it's a tooling issue, as the specifications don't define language like "chr1:1000-2000" to mean a sub-range, but the only reason people make chromosomes named like that is specifically because that's the encoding pattern used by tooling. If tools had used "=" instead of ":" then almost certainly they'd get used in chromosome names instead.

There are some mitigations for this in htslib, where we added the ability to do {chr1}:1000-2000 as a method for explicitly stating the chromosome name is the "chr1" portion, but it's not honoured everywhere and it's specific to command line arguments. We obviously cannot change the specification of things like BED.

In short, yes this is probably a bug, but please consider choosing chromosome names that aren't ambiguous in meaning to downstream tools.

jmarshall commented 1 month ago

While colons within chromosome names are legal within the VCF specification, this was a huge mistake which sadly we cannot undo. They create ambiguities in parsing.

I do not see this as a mistake — once communities such as HLA had standardised on allele names containing colons and with users likely to use them as contigs, it became a requirement for formats such as VCF to enable this usage. I am unaware of any ambiguities in parsing VCF due to colons in chromosome names. It may create difficulties for tools, but if you have a VCF parsing ambiguity in mind please show an example.

It would be a lot easier to investigate this if @groodri provided an actual example of the problem. I attempted to reproduce the problem using a couple of trivial VCF files:

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
JAK2::chr1:1000 100 a   A   T   .   .   .
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:2000>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
JAK2::chr1:2000 500 b   G   C   .   .   .

Compressing and indexing them and running the command shown produces:

$ bcftools concat -a ib2179a.vcf.gz ib2179b.vcf.gz 
Checking the headers and starting positions of 2 files
[E::_regions_init_string] Could not parse the region(s): JAK2::chr1:1000
[E::_regions_init_string] Could not parse the region(s): JAK2::chr1:2000
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
##contig=<ID=JAK2::chr1:2000>
##bcftools_concatVersion=1.20-2-gad818b15+htslib-1.20-2-g18af5c1
##bcftools_concatCommand=concat -a ib2179a.vcf.gz ib2179b.vcf.gz; Date=Tue May  7 20:28:45 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
$ 

Thus the Could not parse the region(s) message has been produced as a warning and bcftools has output a VCF file containing no records. As the in the ##contig=<ID=…> lines unambiguously denotes a contig name (and there are no region strings anywhere near ID in the grammar, such as it is), passing this to _regions_init_string() as is is clearly a bcftools (or htslib) bug. Whatever the code is trying to do, it needs to do it by calling a dumber function that accepts only a contig name or by adding {…} to its call, e.g., so that it calls _regions_init_string("{JAK2::chr1:1000}") for the ##contig line from the first file.

In the meantime this can be worked around by adding an appropriate -r option to the command so that it doesn't go running _regions_init_string() on raw contig headers. For this simple example that looks like

$ bcftools concat -r '{JAK2::chr1:1000},{JAK2::chr1:2000}' -a ib2179a.vcf.gz ib2179b.vcf.gz 
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
##contig=<ID=JAK2::chr1:2000>
##bcftools_concatVersion=1.20-2-gad818b15+htslib-1.20-2-g18af5c1
##bcftools_concatCommand=concat -r {JAK2::chr1:1000},{JAK2::chr1:2000} -a ib2179a.vcf.gz ib2179b.vcf.gz; Date=Tue May  7 20:46:53 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
JAK2::chr1:1000 100 a   A   T   .   .   .
JAK2::chr1:2000 500 b   G   C   .   .   .
$

which I assume is the sort of output the OP was expecting.

jkbonfield commented 1 month ago

My point was one of command line parsing. Bcftools for example permits bcftools view foo.vcf.gz chr1:1000-2000 as a valid syntax. If we have chromosomes named both "chr1" and "chr1:1000-2000" then we're in trouble, and have to start disambiguating based on the presences of contig lines, which aren't even a requirement sadly. (Although they presumably are for indexing of vcf.)

You're right that our hands were forced by HLA, but I still think it was a bad decision right from day one to use colon as a meaning for sub-regions and to not ban it in chromosome names. Life could have been so much simpler, but that's water under the bridge.

groodri commented 1 month ago

To put into context, my use case is amplicon sequencing. Because we are only interested in specific amplicon sequences, it makes more sense computationally to use the amplicon sequences as reference for variant calling (e.g. JAK2::chr1:1000-2000 could be an amplicon in my reference FASTA) then, say, the entire human genome. However, these regions of interest need to stay interpretable in order to be correctly annotated, so that found variants can be evaluated for their importance -- that is, they need to be translated into their locations in the human genome.

Thus, I have VCF files that look like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
JAK2::chr1:1000-2000    100 a   A   T   .   .   .

Where called variants are positioned on the amplicon, which itself is named after the gene and region in the human genome.

I understand there are multiple ways around this (and thank you for your suggestion @jmarshall -- that is indeed the output I was expecting). I was however not sure that this is the intended behavior for bcftools. Particularly, because of the fact that bcftools throws this error as a warning, so exit code is 0. Because exit code is 0, this makes it difficult to integrate this tooling into analysis pipelines with e.g. Nextflow, as we cannot directly test if a VCF file has appropriate contig names for bcftools. If allowing colons in contig names is not suitable, then would it not be better to throw an error, as opposed to a warning?

jkbonfield commented 1 month ago

My point was one of command line parsing. Bcftools for example permits bcftools view foo.vcf.gz chr1:1000-2000 as a valid syntax. If we have chromosomes named both "chr1" and "chr1:1000-2000" then we're in trouble, and have to start disambiguating based on the presences of contig lines, which aren't even a requirement sadly. (Although they presumably are for indexing of vcf.)

Infact a proof of this being a poor choice is in the code that implements it:

https://github.com/samtools/htslib/blob/develop/synced_bcf_reader.c#L1034-L1038

// File name or a list of genomic locations. If file name, NULL is returned. // Recognises regions in the form chr, chr:pos, chr:beg-end, chr:beg-, {weird-chr-name}:pos. // Cannot use hts_parse_region() as that requires the header and if header is not present, // wouldn't learn the chromosome name. static bcf_sr_regions_t _regions_init_string(const char str)

Basically it's stuffed. It accepts chr, chr:pos and chr:beg-end, but cannot disambiguate "chr:beg-end" being a poorly chosen chromosome name from a region of "chr". Basically it needs a code rewrite to include the header so it can then sanitize against the contig lines and figure out which bits of the string represents the chromosome and which may represent a region.

I think our previous discussions on this were purely from a VCF specification viewpoint on whether or not it was technically parseable within the format, but we never discussed whether it was sensible.

It can be solved in code, but I wonder how many other places the same bugs occur?

diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index a43ab15a..de88024a 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -72,7 +72,7 @@ typedef struct
 aux_t;

 static int _regions_add(bcf_sr_regions_t *reg, const char *chr, hts_pos_t start, hts_pos_t end);
-static bcf_sr_regions_t *_regions_init_string(const char *str);
+static bcf_sr_regions_t *_regions_init_string(const char *str, bcf_hdr_t *h);
 static int _regions_match_alleles(bcf_sr_regions_t *reg, int als_idx, bcf1_t *rec);
 static void _regions_sort_and_merge(bcf_sr_regions_t *reg);
 static int _bcf_sr_regions_overlap(bcf_sr_regions_t *reg, const char *seq, hts_pos_t start, hts_pos_t end, int missed_reg_handler);
@@ -372,7 +372,7 @@ int bcf_sr_add_reader(bcf_srs_t *files, const char *fname)
         for (i=0; i<n; i++)
         {
             if ( !files->regions )
-                files->regions = _regions_init_string(names[i]);
+                files->regions = _regions_init_string(names[i], reader->header);
             else
                 _regions_add(files->regions, names[i], -1, -1);
         }
@@ -1035,7 +1035,7 @@ void _regions_sort_and_merge(bcf_sr_regions_t *reg)
 // Recognises regions in the form chr, chr:pos, chr:beg-end, chr:beg-, {weird-chr-name}:pos.
 // Cannot use hts_parse_region() as that requires the header and if header is not present,
 // wouldn't learn the chromosome name.
-static bcf_sr_regions_t *_regions_init_string(const char *str)
+static bcf_sr_regions_t *_regions_init_string(const char *str, bcf_hdr_t *h)
 {
     bcf_sr_regions_t *reg = (bcf_sr_regions_t *) calloc(1, sizeof(bcf_sr_regions_t));
     reg->start = reg->end = -1;
@@ -1060,8 +1060,28 @@ static bcf_sr_regions_t *_regions_init_string(const char *str)
         }
         else
         {
-            while ( *ep && *ep!=',' && *ep!=':' ) ep++;
-            kputsn(sp,ep-sp,&tmp);
+            do {
+                while ( *ep && *ep!=',' && *ep!=':' ) ep++;
+
+                if (!h)
+                    break;
+
+                // Incase we have contig names such as "foo:bar" then
+                // we don't just stop at the first colon.  We keep going until
+                // the prefix matches a known contig.  If we have ambiguity,
+                // eg "chr1" and "chr1:1000-2000" are both contig names, then
+                // we bail out early but my fail the subsequent parse.
+                // We could attempt a reverse search (right to left), but
+                // nothing will remove ambiguity.  Instead we require
+                // explicit "{chr1}:1000-2000" vs "{chr1:1000-2000}" instead.
+                kstring_t ks = KS_INITIALIZE;
+                if (kputsn(sp, ep-sp, &ks) < 0)
+                    goto exit_nicely;
+                int found = bcf_hdr_name2id(h, ks.s) >= 0;
+                free(ks.s);
+                if (found)
+                    break;
+            } while (*++ep);
         }
         if ( *ep==':' )
         {
@@ -1184,7 +1204,7 @@ bcf_sr_regions_t *bcf_sr_regions_init(const char *regions, int is_file, int ichr
     bcf_sr_regions_t *reg;
     if ( !is_file )
     {
-        reg = _regions_init_string(regions);
+        reg = _regions_init_string(regions, NULL);
         _regions_sort_and_merge(reg);
         return reg;
     }

Edit: can possibly use hts_parse_region with a suitably concocted query function, but htslib's vcf nor bcftools ever appears to use this so I haven't verified if it's viable. I assume we created it with the aim of bcftools doing such validity checking, but it was never deemed useful there?

jkbonfield commented 1 month ago

Also I just can't get bcftools concat to work full stop with -a. I think it may not work. An example:

@ seq4d[/tmp]; zcat _a.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
foo 100 a   A   T   .   .   .

@ seq4d[/tmp]; zcat _b.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
bar 500 b   G   C   .   .   .

@ seq4d[/tmp]; bcftools concat -a _[ab].vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
##bcftools_concatVersion=1.20-12-g18439d1d+htslib-1.20-8-g7576aca1-dirty
##bcftools_concatCommand=concat -a _a.vcf.gz _b.vcf.gz; Date=Tue May  7 15:08:27 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
bar 500 b   G   C   .   .   .

Despite identical headers and data for two separate chromosomes (the exact case in the usage statement it claims to be written for), I only get the 2nd chromosome output. This doesn't happen if I remove -a though.

@ seq4d[/tmp]; bcftools concat _[ab].vcf.gz
Checking the headers and starting positions of 2 files
Concatenating _a.vcf.gz 0.000361 seconds
Concatenating _b.vcf.gz 0.000355 seconds
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
##bcftools_concatVersion=1.20-12-g18439d1d+htslib-1.20-8-g7576aca1-dirty
##bcftools_concatCommand=concat _a.vcf.gz _b.vcf.gz; Date=Tue May  7 15:08:31 2024
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
foo 100 a   A   T   .   .   .
bar 500 b   G   C   .   .   .

@pd3 can you confirm this is a bug, or is it just confusing documentation / usage? (Ie user error) If it's a bug I can move this comment to its own issue.

jkbonfield commented 1 month ago

Ignore that last comment. It's some strangeness of pos 0 vs pos -1 when doing all-contig ranges. I don't understand the code well enough to know why it matters.

Anyway, I have a PR incoming that will use hts_parse_regions to resolve this, with some shenanigans around positions.

jmarshall commented 1 month ago

I was however not sure that this is the intended behavior for bcftools.

Like I said, the behaviour is clearly a bug. The particular bug is fixed by samtools/htslib#1781. James says he has fixed it a different way by improving the code to use the headers where available ; probably both PRs should be applied.

ETA :— As James's commit only affects the bad _regions_init_string() call inside bcf_sr_add_reader(), it's probably subsumed by my PR.

All this code exists because the synced reader code does not assume the presence of contig-defining headers so cannot use hts_parse_region(), and the bcftools maintainers did not prioritise refactoring this for the benefit of colon-using contigs. It's been discussed before in the bcftools context, e.g. in samtools/htslib#1620 and #1797. James's change to make _regions_init_string() optionally use hts_parse_region() if provided with headers would be useful for the call within bcf_sr_regions_init() (which is the only call remaining after my PR) — it would then essentially be implementing the last paragraph of https://github.com/samtools/htslib/issues/1620#issuecomment-1556999904, which would be :tada:.

jkbonfield commented 1 month ago

Correct. My change modified bcf_sr_regions_init so it could take a header and when given a header it used hts_parse_region to parse it correctly instead of breaking on colons. As bcf_sr_regions_init is static I'm permitted to change the API this way.

It had two calls, one was in bcf_sr_add_reader and you correctly identified that the call to bcf_sr_regions_init was a bug and you removed it. The other is in bcf_sr_regions_init. However only the former (now defunct) had access to the header, meaning the change I made, while correct, can never have any benefit. We can't fix bcf_sr_regions_init as it's an external API, so it'll just have to remain broken unless we create a non-buggy bcf_sr_regions_init2 function that has the header as an argument.

My code is in https://github.com/jkbonfield/htslib/commit/34d6a6dea01337348b43d294d1dfbef107f9e56a for posterity, incase we ever find a use for this internal function elsewhere, but I don't think it's worth fixing it now.

For reference, it looks like only vcfstats and vcfannotate (of our code) calls bcf_sr_regions_init anyway, so the colon buggyness is limited in scope.