samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

changes to support multiple files #1658

Closed vasudeva8 closed 9 months ago

vasudeva8 commented 11 months ago

This PR contains the changes for #1642.

Multiple files are supported and given operation is performed on all inputs, except with rebgzip. Returns failure when multiple inputs are given with -I option. Returns failure with rebgzip and index used together. -b / -s options, which requires index file, work with single input when index is explicitly providing using -I. -b / -s can work with multiple inputs if implicit indexes, i.e. \<compressed filename.gzi>, are in use (i.e. no -I option and all having their implicit index file along with it).

Corrected the error message with rebgzip invocation without -I option.

hfile.c/h, bgzf.c updated to add new methods which doesn't close the stdin/out, to support multi file processing.

jkbonfield commented 11 months ago

This feels overly complicated, but some of the changes are also down to improved error detection and identification of clashing options, such as region querying or rebgzip.

However the bulk of the functionality could be as simple as a do while loop on optind:

diff --git a/bgzip.c b/bgzip.c
index 589f79f6..fe6eacb0 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -188,6 +188,7 @@ int main(int argc, char **argv)
         fprintf(stderr, "[bgzip] Illegal region: [%ld, %ld]\n", start, end);
         return 1;
     }
+    do {
     if (compress == 1) {
         hFILE* f_src = NULL;
         char out_mode[3] = "w\0";
@@ -369,7 +370,6 @@ int main(int argc, char **argv)
             error("Input close failed\n");
         if (argc > optind && !pstdout && !keep) unlink(argv[optind]);
         free(buffer);
-        return 0;
     }
     else if ( reindex )
     {
@@ -401,7 +401,6 @@ int main(int argc, char **argv)
         }

         if ( bgzf_close(fp)<0 ) error("Close failed: Error %d\n",fp->errcode);
-        return 0;
     }
     else
     {
@@ -494,6 +493,7 @@ int main(int argc, char **argv)
 #ifdef _WIN32
         _setmode(f_dst, O_BINARY);
 #endif
+        long start_reg = start, end_reg = end;
         while (1) {
             if (end < 0) c = bgzf_read(fp, buffer, WINDOW_SIZE);
             else c = bgzf_read(fp, buffer, (end - start > WINDOW_SIZE)? WINDOW_SIZE:(end - start));
@@ -508,9 +508,12 @@ int main(int argc, char **argv)
             }
             if (end >= 0 && start >= end) break;
         }
+        start = start_reg;
+        end = end_reg;
         free(buffer);
         if (bgzf_close(fp) < 0) error("Close failed: Error %d\n",fp->errcode);
         if (argc > optind && !pstdout && !test && !keep) unlink(argv[optind]);
-        return 0;
     }
+    } while (++optind < argc);
+    return 0;
 }

Obviously that needs better handling of the rebgzip case and it's probably better to have a pair of do-while loops handling compress and decompress separately, but I'm not entirely sure the large scale changes are justified by this feature request. That said, some code restructuring isn't a bad idea given the length of main. It just makes it a little trickier to work out what's changed and whether all of the functionality has been kept.

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename. (We can add an implicit "-" in the case of no args and stdin isn't a tty.) Opening it in main and having custom code for it feels a bit strange and very tricky to follow. What am I missing here?

Is it the ./bgzip -cd _a.gz _b.gz case where we wish to output two files to the same stream (which my naive do-while didn't handle I expect)? If so we can probably resolve this by only closing stdout hts streams at the end of the program, so we can concatenate outputs.

jmarshall commented 11 months ago

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename.

hopen_fd_stdinout() just creates an hFILE_fd with an fd of 0 or 1. Hence in = hopen("-", "r") … hclose(in) … in2 = hopen("-", "r") would fail because fd 0 is no longer open or no longer refers to the original stdin. This could be complexified by having it use dup() instead of literally STDIN_FILENO (0) or STDOUT_FILENO (1) so hclose() wouldn't close the only reference to the original stdin/stdout stream, but really this has caused no significant problems in all the years since it was implemented so far — largely because reading two BAM files from stdin or writing two BAM files to stdout just isn't a useful operation.

So if there are useful bgzip operations involving reading multiple streams from stdin (probably not) or writing multiple streams to stdout (maybe), they are currently going to have to do something special for stdin/stdout. Even if that is only not actually calling hclose() except the last time.

The code is now more complicated than I was expecting too. Perhaps all the casting could be avoided by separating operate() into separate functions for the different modes?

vasudeva8 commented 10 months ago

Updated to avoid the complexities, uses the do..while framework and uses dup to open/close stdin/out as required. -I is supported with multiple inputs as well - as we do rebgzip!

jmarshall commented 10 months ago

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename.

Upon further reflection, I think stdin/stdout should just be dealt with by hts_open/hts_close automatically just as any other filename, and @vasudeva8 should be able to repeatedly hts_open and hts_close stdin and stdout just as for any other filename.

PR #1665 modifies hopen_fd_stdinout() to enable that. If that approach is accepted by the maintainers, this PR can be rebased onto that once merged, and all the dup() gymnastics can be removed — which should enable some simplification of the code.

vasudeva8 commented 10 months ago

Updated and uses new shared stdin/out hfile. Uses the do/while as earlier and explicitly closes stdout at the end. Moved a few validations to ensure they are made before output file update.

jkbonfield commented 9 months ago

Also, this sounded on paper like a trivial issue to resolve, but the myriad of interaction between all the command line arguments shows it's far trickier than I expected! Good going on navigating the minefield, especially fixing the long-standing issues caused by said minefield. :)

vasudeva8 commented 9 months ago

Updated to avoid explicit index file with multiple input files, during index and reindex. These options are not considered during decompress/read operation.