samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Add support for non-standard chromosome names containing [:-] characters #1630

Closed pd3 closed 1 year ago

pd3 commented 1 year ago

Note hts_parse_region() cannot be used because it requires the header and without the header the caller does not learn the contig name.

Resolves #1620

jkbonfield commented 1 year ago

Apparently I had misremembered the public hts_parse_region API as having the header as an optional field. Pity :/

I'm inclined to think that if we have a requirement to parse regions without a header being present, then perhaps this API should be modified to cope with it. My preference would be to avoid code duplication where possible, but I haven't really looked in detail to know how much work is involved and whether we just replace duplication in one place with it in another (eg a separate code-path for headerless parsing).

jkbonfield commented 1 year ago

Can you give me an example of how to trigger this? I'm unable to do so, so cannot test the fix.

I tried modifying test/test-bcf-sr.pl as follows:

ubuntu@jkb-dev-pmu:~/htslib$ git diff
diff --git a/htscodecs b/htscodecs
--- a/htscodecs
+++ b/htscodecs
@@ -1 +1 @@
-Subproject commit d4aed585929e2dab9dd8e6a2b74484dfc347c0f2
+Subproject commit d4aed585929e2dab9dd8e6a2b74484dfc347c0f2-dirty
diff --git a/test/test-bcf-sr.pl b/test/test-bcf-sr.pl
index cd9859c..f65d120 100755
--- a/test/test-bcf-sr.pl
+++ b/test/test-bcf-sr.pl
@@ -119,7 +119,7 @@ sub save_vcf
     open(my $fh,"| $FindBin::Bin/../bgzip -c > $fname") or error("$FindBin::Bin/../bgzip -c > $fname: !");
     print $fh qq[##fileformat=VCFv4.3\n];
     print $fh qq[##FILTER=<ID=PASS,Description="All filters passed">\n];
-    print $fh qq[##contig=<ID=1>\n];
+    print $fh qq[##contig=<ID=1:2>\n];
     print $fh qq[##contig=<ID=2>\n];
     print $fh '#'. join("\t", qw(CHROM POS ID  REF ALT QUAL    FILTER  INFO))."\n";
     for my $var (@$vars)
@@ -133,7 +133,7 @@ sub save_vcf
             $ref = $xref;
             push @alts,$alt;
         }
-        print $fh join("\t", (1,100,'.',$ref,join(',',@alts),'.','.','.'))."\n";
+        print $fh join("\t", ("1:2",100,'.',$ref,join(',',@alts),'.','.','.'))."\n";
     }
     for my $var (@$vars)
     {
@@ -146,7 +146,7 @@ sub save_vcf
             $ref = $xref;
             push @alts,$alt;
         }
-        print $fh join("\t", (1,300,'.',$ref,join(',',@alts),'.','.','.'))."\n";
+        print $fh join("\t", ("1:2",300,'.',$ref,join(',',@alts),'.','.','.'))."\n";
     }
     for my $var (@$vars)
     {

So my files look like this:

ubuntu@jkb-dev-pmu:~/htslib$ zcat /tmp/a/1.vcf.gz
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=1:2>
##contig=<ID=2>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1:2 100 .   A   T   .   .   .
1:2 100 .   A   G,T .   .   .
1:2 300 .   A   T   .   .   .
1:2 300 .   A   G,T .   .   .
2   100 .   A   T   .   .   .
2   100 .   A   G,T .   .   .

Running ./test/test-bcf-sr.pl -t /tmp/a -s 0 works still. Although I see lots of errors in the output files, but it returns 0 exit status. The original also produced the same errors and also succeeded. I don't understand what's happening here in the tests. Are the failures "expected failures" and hence passes? I'm assuming so, as most checks don't redirect stderr, so the ones that do are presumably expecting an error and deliberately silencing it.

[Edit: yes, I see it now in "Failed to detect $badness: $cmd". It's specifically checking on duff input to make sure it doesn't pass. That's good. :-) ]

Is it simply that test-bcf-sr.c doesn't test the region query part of the API, so colons in names never crops up as an issue? I don't particularly like having the test for this function in bcftools, as it means we can't do CI for this code.

pd3 commented 1 year ago

I just checked the code and can confirm that test-bcf-sr does not test indexes at all. That's done properly only through the bcftools test suite, and there are hundreds of them. If you want to test it natively in htslib, best might be to write a simplified version of vcfview.c. Tests for this pull requests are https://github.com/samtools/bcftools/pull/1938/files.

daviesrob commented 1 year ago

Merge commit removed...