samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
634 stars 241 forks source link

Make --write-index=tbi work. #2136

Closed jkbonfield closed 3 months ago

jkbonfield commented 3 months ago

This adds a new function init_index2 with an additional index format field, which is set by another new function write_index_parse. I could have modified the existing init_index function, but I was unsure if this is meant to be a public API or not. Eg do people write their own plugins and use this? If not, we could just rename it back.

The PR makes --write-index take an optional --write-index[=FMT] syntax, eg --write-index=tbi. As this is a long option, the optional argument is easy to define. (They don't work well on short opts, being space concious.) The default of not specified =FMT is whatever the tool did before. This is almost always CSI, except for isec which is TBI if VCF. I don't know why that single command has a different default, but I didn't want to second guess the logic so I left it unchanged.

I tested all subcommands capable of writing indices, including plugins (although just fill-tags for the generic writing in plugin.c as it's shared by many). I also fixed a number of bugs, not all index related.

Fixes #2008

pd3 commented 3 months ago

Thank you for this! Can you update the documentation as well please, doc/bcftools.txt?

jkbonfield commented 3 months ago

So this change would, for example, allow -W=tbi to work as a short option. It's not really the way short options are meant to be used with optional arguments however, but I feel it's less likely to be a source of confusion and questions than going with the official nomenclature of -Wtbi (which still works).

It simplifies the usage too, as we wouldn't need to split it into -W[FMT], --write-index[=FMT].

It appears the frequent strategy though, if we want both long and short opts, is normally not to make them have optional args and have two options. Eg -W, --write-index plus --index-fmt FMT.

Thoughts?

diff --git a/vcfview.c b/vcfview.c
index c53ca1d0..43845af7 100644
--- a/vcfview.c
+++ b/vcfview.c
@@ -550,7 +550,7 @@ static void usage(args_t *args)
     fprintf(stderr, "    -u/U, --uncalled/--exclude-uncalled    Select/exclude sites without a called genotype\n");
     fprintf(stderr, "    -v/V, --types/--exclude-types LIST     Select/exclude comma-separated list of variant types: snps,indels,mnps,ref,bnd,other [null]\n");
     fprintf(stderr, "    -x/X, --private/--exclude-private      Select/exclude sites where the non-reference alleles are exclusive (private) to the subset samples\n");
-    fprintf(stderr, "          --write-index[=FMT]              Automatically index the output files [off]\n");
+    fprintf(stderr, "    -W,   --write-index[=FMT]              Automatically index the output files [off]\n");
     fprintf(stderr, "\n");
     exit(1);
 }
@@ -617,11 +617,11 @@ int main_vcfview(int argc, char *argv[])
         {"phased",no_argument,NULL,'p'},
         {"exclude-phased",no_argument,NULL,'P'},
         {"no-version",no_argument,NULL,8},
-        {"write-index",optional_argument,NULL,10},
+        {"write-index",optional_argument,NULL,'W'},
         {NULL,0,NULL,0}
     };
     char *tmp;
-    while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:aAuUhHc:C:Ii:e:xXpPq:Q:g:",loptions,NULL)) >= 0)
+    while ((c = getopt_long(argc, argv, "l:t:T:r:R:o:O:s:S:Gf:knv:V:m:M:aAuUhHc:C:Ii:e:xXpPq:Q:g:W::",loptions,NULL)) >= 0)
     {
         char allele_type[9] = "nref";
         switch (c)
@@ -750,7 +750,7 @@ int main_vcfview(int argc, char *argv[])
                 break;
             case  9 : args->n_threads = strtol(optarg, 0, 0); break;
             case  8 : args->record_cmd_line = 0; break;
-            case 10 :
+            case 'W' :
                 if (!(args->write_index = write_index_parse(optarg)))
                     error("Unsupported index format '%s'\n", optarg);
                 break;
diff --git a/version.c b/version.c
index ea112f60..cf2c6d51 100644
--- a/version.c
+++ b/version.c
@@ -119,9 +119,9 @@ int write_index_parse(char *arg) {
     int fmt = HTS_FMT_CSI;

     if (arg) {
-        if (strcmp(arg, "csi") == 0)
+        if (strcmp(arg, "csi") == 0 || strcmp(arg, "=csi") == 0)
             fmt = HTS_FMT_CSI;
-        else if (strcmp(arg, "tbi") == 0)
+        else if (strcmp(arg, "tbi") == 0 || strcmp(arg, "=tbi") == 0)
             fmt = HTS_FMT_TBI;
         else
             return 0;
jkbonfield commented 3 months ago

Ok I added -W in there too.

I'm unsure if we want to support -W=tbi as well as -Wtbi, but personally I dislike the latter and prefer the former as it shows consistency with long options and is distinct from the mandatory option versions. Either way, it's a trivial couple lines to tweak in version.c so I'll let you decide (and a whole raft of search and replace in the docs).

Also fixed a couple other doc issues in the process.

Note I didn't change the usage statements to be more verbose. IMO it doesn't need it, and I dislike huge long lines (some of them are 160 characters, which breaks most unix conventions). I don't think you'd like me line-wrapping them either, as we've had that discussion in the past. So again, if you want that doing, I'll let you make that editorial decision at merge.

mbhall88 commented 3 months ago

It appears the frequent strategy though, if we want both long and short opts, is normally not to make them have optional args and have two options. Eg -W, --write-index plus --index-fmt FMT.

My two cents would be to prefer this approach. But I appreciate my two cents should have very low market value. Just providing a user perspective...

pd3 commented 3 months ago

It appears the frequent strategy though, if we want both long and short opts, is normally not to make them have optional args and have two options. Eg -W, --write-index plus --index-fmt FMT.

Optional arguments are unexplored territory in bcftools and I don't know how likely they are to cause problems in practice. Adding the new --index-fmt option seems like a more conservative and safer approach, but I don't insist.