Closed groodri closed 1 month ago
While colons within chromosome names are legal within the VCF specification, this was a huge mistake which sadly we cannot undo. They create ambiguities in parsing.
We've seen this before, where people create a contig name "chr1" and then another "chr1:1000-2000" to indicate a portion of that chromosome. Only now how do you specify the range of chr1 that happens to cover 1000-2000? Arguably it's a tooling issue, as the specifications don't define language like "chr1:1000-2000" to mean a sub-range, but the only reason people make chromosomes named like that is specifically because that's the encoding pattern used by tooling. If tools had used "=" instead of ":" then almost certainly they'd get used in chromosome names instead.
There are some mitigations for this in htslib, where we added the ability to do {chr1}:1000-2000 as a method for explicitly stating the chromosome name is the "chr1" portion, but it's not honoured everywhere and it's specific to command line arguments. We obviously cannot change the specification of things like BED.
In short, yes this is probably a bug, but please consider choosing chromosome names that aren't ambiguous in meaning to downstream tools.
While colons within chromosome names are legal within the VCF specification, this was a huge mistake which sadly we cannot undo. They create ambiguities in parsing.
I do not see this as a mistake — once communities such as HLA had standardised on allele names containing colons and with users likely to use them as contigs, it became a requirement for formats such as VCF to enable this usage. I am unaware of any ambiguities in parsing VCF due to colons in chromosome names. It may create difficulties for tools, but if you have a VCF parsing ambiguity in mind please show an example.
It would be a lot easier to investigate this if @groodri provided an actual example of the problem. I attempted to reproduce the problem using a couple of trivial VCF files:
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
#CHROM POS ID REF ALT QUAL FILTER INFO
JAK2::chr1:1000 100 a A T . . .
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:2000>
#CHROM POS ID REF ALT QUAL FILTER INFO
JAK2::chr1:2000 500 b G C . . .
Compressing and indexing them and running the command shown produces:
$ bcftools concat -a ib2179a.vcf.gz ib2179b.vcf.gz
Checking the headers and starting positions of 2 files
[E::_regions_init_string] Could not parse the region(s): JAK2::chr1:1000
[E::_regions_init_string] Could not parse the region(s): JAK2::chr1:2000
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
##contig=<ID=JAK2::chr1:2000>
##bcftools_concatVersion=1.20-2-gad818b15+htslib-1.20-2-g18af5c1
##bcftools_concatCommand=concat -a ib2179a.vcf.gz ib2179b.vcf.gz; Date=Tue May 7 20:28:45 2024
#CHROM POS ID REF ALT QUAL FILTER INFO
$
Thus the Could not parse the region(s)
message has been produced as a warning and bcftools has output a VCF file containing no records. As the …
in the ##contig=<ID=…>
lines unambiguously denotes a contig name (and there are no region strings anywhere near ID in the grammar, such as it is), passing this to _regions_init_string()
as is is clearly a bcftools (or htslib) bug. Whatever the code is trying to do, it needs to do it by calling a dumber function that accepts only a contig name or by adding {…}
to its call, e.g., so that it calls _regions_init_string("{JAK2::chr1:1000}")
for the ##contig
line from the first file.
In the meantime this can be worked around by adding an appropriate -r
option to the command so that it doesn't go running _regions_init_string()
on raw contig headers. For this simple example that looks like
$ bcftools concat -r '{JAK2::chr1:1000},{JAK2::chr1:2000}' -a ib2179a.vcf.gz ib2179b.vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=JAK2::chr1:1000>
##contig=<ID=JAK2::chr1:2000>
##bcftools_concatVersion=1.20-2-gad818b15+htslib-1.20-2-g18af5c1
##bcftools_concatCommand=concat -r {JAK2::chr1:1000},{JAK2::chr1:2000} -a ib2179a.vcf.gz ib2179b.vcf.gz; Date=Tue May 7 20:46:53 2024
#CHROM POS ID REF ALT QUAL FILTER INFO
JAK2::chr1:1000 100 a A T . . .
JAK2::chr1:2000 500 b G C . . .
$
which I assume is the sort of output the OP was expecting.
My point was one of command line parsing. Bcftools for example permits bcftools view foo.vcf.gz chr1:1000-2000
as a valid syntax. If we have chromosomes named both "chr1" and "chr1:1000-2000" then we're in trouble, and have to start disambiguating based on the presences of contig lines, which aren't even a requirement sadly. (Although they presumably are for indexing of vcf.)
You're right that our hands were forced by HLA, but I still think it was a bad decision right from day one to use colon as a meaning for sub-regions and to not ban it in chromosome names. Life could have been so much simpler, but that's water under the bridge.
To put into context, my use case is amplicon sequencing. Because we are only interested in specific amplicon sequences, it makes more sense computationally to use the amplicon sequences as reference for variant calling (e.g. JAK2::chr1:1000-2000
could be an amplicon in my reference FASTA) then, say, the entire human genome. However, these regions of interest need to stay interpretable in order to be correctly annotated, so that found variants can be evaluated for their importance -- that is, they need to be translated into their locations in the human genome.
Thus, I have VCF files that look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO
JAK2::chr1:1000-2000 100 a A T . . .
Where called variants are positioned on the amplicon, which itself is named after the gene and region in the human genome.
I understand there are multiple ways around this (and thank you for your suggestion @jmarshall -- that is indeed the output I was expecting). I was however not sure that this is the intended behavior for bcftools. Particularly, because of the fact that bcftools throws this error as a warning, so exit code is 0. Because exit code is 0, this makes it difficult to integrate this tooling into analysis pipelines with e.g. Nextflow, as we cannot directly test if a VCF file has appropriate contig names for bcftools. If allowing colons in contig names is not suitable, then would it not be better to throw an error, as opposed to a warning?
My point was one of command line parsing. Bcftools for example permits
bcftools view foo.vcf.gz chr1:1000-2000
as a valid syntax. If we have chromosomes named both "chr1" and "chr1:1000-2000" then we're in trouble, and have to start disambiguating based on the presences of contig lines, which aren't even a requirement sadly. (Although they presumably are for indexing of vcf.)
Infact a proof of this being a poor choice is in the code that implements it:
https://github.com/samtools/htslib/blob/develop/synced_bcf_reader.c#L1034-L1038
// File name or a list of genomic locations. If file name, NULL is returned. // Recognises regions in the form chr, chr:pos, chr:beg-end, chr:beg-, {weird-chr-name}:pos. // Cannot use hts_parse_region() as that requires the header and if header is not present, // wouldn't learn the chromosome name. static bcf_sr_regions_t _regions_init_string(const char str)
Basically it's stuffed. It accepts chr, chr:pos and chr:beg-end, but cannot disambiguate "chr:beg-end" being a poorly chosen chromosome name from a region of "chr". Basically it needs a code rewrite to include the header so it can then sanitize against the contig lines and figure out which bits of the string represents the chromosome and which may represent a region.
I think our previous discussions on this were purely from a VCF specification viewpoint on whether or not it was technically parseable within the format, but we never discussed whether it was sensible.
It can be solved in code, but I wonder how many other places the same bugs occur?
diff --git a/synced_bcf_reader.c b/synced_bcf_reader.c
index a43ab15a..de88024a 100644
--- a/synced_bcf_reader.c
+++ b/synced_bcf_reader.c
@@ -72,7 +72,7 @@ typedef struct
aux_t;
static int _regions_add(bcf_sr_regions_t *reg, const char *chr, hts_pos_t start, hts_pos_t end);
-static bcf_sr_regions_t *_regions_init_string(const char *str);
+static bcf_sr_regions_t *_regions_init_string(const char *str, bcf_hdr_t *h);
static int _regions_match_alleles(bcf_sr_regions_t *reg, int als_idx, bcf1_t *rec);
static void _regions_sort_and_merge(bcf_sr_regions_t *reg);
static int _bcf_sr_regions_overlap(bcf_sr_regions_t *reg, const char *seq, hts_pos_t start, hts_pos_t end, int missed_reg_handler);
@@ -372,7 +372,7 @@ int bcf_sr_add_reader(bcf_srs_t *files, const char *fname)
for (i=0; i<n; i++)
{
if ( !files->regions )
- files->regions = _regions_init_string(names[i]);
+ files->regions = _regions_init_string(names[i], reader->header);
else
_regions_add(files->regions, names[i], -1, -1);
}
@@ -1035,7 +1035,7 @@ void _regions_sort_and_merge(bcf_sr_regions_t *reg)
// Recognises regions in the form chr, chr:pos, chr:beg-end, chr:beg-, {weird-chr-name}:pos.
// Cannot use hts_parse_region() as that requires the header and if header is not present,
// wouldn't learn the chromosome name.
-static bcf_sr_regions_t *_regions_init_string(const char *str)
+static bcf_sr_regions_t *_regions_init_string(const char *str, bcf_hdr_t *h)
{
bcf_sr_regions_t *reg = (bcf_sr_regions_t *) calloc(1, sizeof(bcf_sr_regions_t));
reg->start = reg->end = -1;
@@ -1060,8 +1060,28 @@ static bcf_sr_regions_t *_regions_init_string(const char *str)
}
else
{
- while ( *ep && *ep!=',' && *ep!=':' ) ep++;
- kputsn(sp,ep-sp,&tmp);
+ do {
+ while ( *ep && *ep!=',' && *ep!=':' ) ep++;
+
+ if (!h)
+ break;
+
+ // Incase we have contig names such as "foo:bar" then
+ // we don't just stop at the first colon. We keep going until
+ // the prefix matches a known contig. If we have ambiguity,
+ // eg "chr1" and "chr1:1000-2000" are both contig names, then
+ // we bail out early but my fail the subsequent parse.
+ // We could attempt a reverse search (right to left), but
+ // nothing will remove ambiguity. Instead we require
+ // explicit "{chr1}:1000-2000" vs "{chr1:1000-2000}" instead.
+ kstring_t ks = KS_INITIALIZE;
+ if (kputsn(sp, ep-sp, &ks) < 0)
+ goto exit_nicely;
+ int found = bcf_hdr_name2id(h, ks.s) >= 0;
+ free(ks.s);
+ if (found)
+ break;
+ } while (*++ep);
}
if ( *ep==':' )
{
@@ -1184,7 +1204,7 @@ bcf_sr_regions_t *bcf_sr_regions_init(const char *regions, int is_file, int ichr
bcf_sr_regions_t *reg;
if ( !is_file )
{
- reg = _regions_init_string(regions);
+ reg = _regions_init_string(regions, NULL);
_regions_sort_and_merge(reg);
return reg;
}
Edit: can possibly use hts_parse_region
with a suitably concocted query function, but htslib's vcf nor bcftools ever appears to use this so I haven't verified if it's viable. I assume we created it with the aim of bcftools doing such validity checking, but it was never deemed useful there?
Also I just can't get bcftools concat to work full stop with -a. I think it may not work. An example:
@ seq4d[/tmp]; zcat _a.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
#CHROM POS ID REF ALT QUAL FILTER INFO
foo 100 a A T . . .
@ seq4d[/tmp]; zcat _b.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
#CHROM POS ID REF ALT QUAL FILTER INFO
bar 500 b G C . . .
@ seq4d[/tmp]; bcftools concat -a _[ab].vcf.gz
Checking the headers and starting positions of 2 files
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
##bcftools_concatVersion=1.20-12-g18439d1d+htslib-1.20-8-g7576aca1-dirty
##bcftools_concatCommand=concat -a _a.vcf.gz _b.vcf.gz; Date=Tue May 7 15:08:27 2024
#CHROM POS ID REF ALT QUAL FILTER INFO
bar 500 b G C . . .
Despite identical headers and data for two separate chromosomes (the exact case in the usage statement it claims to be written for), I only get the 2nd chromosome output. This doesn't happen if I remove -a
though.
@ seq4d[/tmp]; bcftools concat _[ab].vcf.gz
Checking the headers and starting positions of 2 files
Concatenating _a.vcf.gz 0.000361 seconds
Concatenating _b.vcf.gz 0.000355 seconds
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=foo>
##contig=<ID=bar>
##bcftools_concatVersion=1.20-12-g18439d1d+htslib-1.20-8-g7576aca1-dirty
##bcftools_concatCommand=concat _a.vcf.gz _b.vcf.gz; Date=Tue May 7 15:08:31 2024
#CHROM POS ID REF ALT QUAL FILTER INFO
foo 100 a A T . . .
bar 500 b G C . . .
@pd3 can you confirm this is a bug, or is it just confusing documentation / usage? (Ie user error) If it's a bug I can move this comment to its own issue.
Ignore that last comment. It's some strangeness of pos 0 vs pos -1 when doing all-contig ranges. I don't understand the code well enough to know why it matters.
Anyway, I have a PR incoming that will use hts_parse_regions
to resolve this, with some shenanigans around positions.
I was however not sure that this is the intended behavior for bcftools.
Like I said, the behaviour is clearly a bug. The particular bug is fixed by samtools/htslib#1781. James says he has fixed it a different way by improving the code to use the headers where available ; probably both PRs should be applied.
ETA :— As James's commit only affects the bad _regions_init_string()
call inside bcf_sr_add_reader()
, it's probably subsumed by my PR.
All this code exists because the synced reader code does not assume the presence of contig-defining headers so cannot use hts_parse_region()
, and the bcftools maintainers did not prioritise refactoring this for the benefit of colon-using contigs. It's been discussed before in the bcftools context, e.g. in samtools/htslib#1620 and #1797. James's change to make _regions_init_string()
optionally use hts_parse_region()
if provided with headers would be useful for the call within bcf_sr_regions_init()
(which is the only call remaining after my PR) — it would then essentially be implementing the last paragraph of https://github.com/samtools/htslib/issues/1620#issuecomment-1556999904, which would be :tada:.
Correct. My change modified bcf_sr_regions_init
so it could take a header and when given a header it used hts_parse_region
to parse it correctly instead of breaking on colons. As bcf_sr_regions_init
is static I'm permitted to change the API this way.
It had two calls, one was in bcf_sr_add_reader
and you correctly identified that the call to bcf_sr_regions_init
was a bug and you removed it. The other is in bcf_sr_regions_init
. However only the former (now defunct) had access to the header, meaning the change I made, while correct, can never have any benefit. We can't fix bcf_sr_regions_init
as it's an external API, so it'll just have to remain broken unless we create a non-buggy bcf_sr_regions_init2
function that has the header as an argument.
My code is in https://github.com/jkbonfield/htslib/commit/34d6a6dea01337348b43d294d1dfbef107f9e56a for posterity, incase we ever find a use for this internal function elsewhere, but I don't think it's worth fixing it now.
For reference, it looks like only vcfstats and vcfannotate (of our code) calls bcf_sr_regions_init
anyway, so the colon buggyness is limited in scope.
Note: Specific regions were substituted by
<gene>
,<chrom>
and<pos>
to hide sensitive information. Real usage did not contain brackets.After combining region of interest locations with
bedtools
to build a new reference FASTA:And then doing variant calling with this new reference, we obtain a chromosome (VCF column
#CHROM
) name structure like so:<gene>::<chrom>:<pos>
Upon using
bcftools concat
to merge VCF files:This structure throws an error:
However, as of VCFv4.3 and later this structure should be legal. See the official VCFv4.3 changelog:
This was tested on
bcftools=1.20
, installed from source code, and previouslybcftools=1.15.1
, installed on conda.I tried to confirm which
HTSlib
version was being compiled, but I couldn't figure it out. Does the current release ofbcftools
support VCFv4.3 and later?