samtools / htslib

C library for high-throughput sequencing data formats
Other
804 stars 446 forks source link

Avoid really closing stdin/stdout in hclose()/hts_close()/et al #1665

Closed jmarshall closed 1 year ago

jmarshall commented 1 year ago

Revisiting the stdin/stdout/dup conversation from #1658:

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename.

hopen_fd_stdinout() just creates an hFILE_fd with an fd of 0 or 1. Hence in = hopen("-", "r") … hclose(in) … in2 = hopen("-", "r") would fail because fd 0 is no longer open or no longer refers to the original stdin. […] really this has caused no significant problems in all the years since it was implemented so far — largely because reading two BAM files from stdin or writing two BAM files to stdout just isn't a useful operation.

Reconsidering this… In fact it has caused some problems over the years: htsfile.c needed some circumlocutions because it too wants to write several SAM/VCF (i.e. textual) htsFile streams to stdout, and pysam has had trouble due to stdout being closed by samtools and bcftools. Moreover, as stdin and stdout are already open and hence are not opened by hts_open/hopen, it's not really morally right for them to be closed by hts_close/hclose.

Hence this pull request, which adds a mode option letter to hdopen() to signal that the fd should not be closed by hclose(), and uses it in hopen_fd_stdinout() which underlies hopen("-", …).

This enables repeated hopen("-") / hclose() / hopen("-") where previously the underlying STDIN/OUT_FILENO would have been prematurely closed. This will mean that the linked bgzip.c pull request would not need to treat "-" specially.

This also means that stdout is never really closed and hclose()'s return value does not reflect closing the underlying fd. Hence particularly paranoid programs that have written significant data to stdout will want to close and check it themselves just before the end of main():

if (fclose(stdout) != 0 && errno != EBADF) perror("closing stdout")

(Ignoring EBADF as stdout may already have been closed and checked, e.g. if the program has been linked to an earlier HTSlib where hclose() still closes STDOUT_FILENO.)

The second commit simplifies htsfile.c's file opening and adds a final fclose(stdout) check accordingly.

jmarshall commented 1 year ago

As noted on the previous htslib PR:

In particular, there are some I/O errors that are not reported by the kernel until you call close(2). So to fully check for errors in some-htslib-using-command args > output.bam you need to check the return value of hclose() and it needs to reflect the return value of close(2) even for stdout. Otherwise you risk undetected corruption in output.bam.

Alternatively, you need to explicitly close(STDOUT_FILENO) (or an equivalent) separately from hclose(), as described in the initial comment and demonstrated for htsfile.c. See also samtools/samtools#1909 which adds such a check for samtools. If this approach is accepted, I will prepare a similar PR for bcftools.

jmarshall commented 1 year ago

Having hclose() et al not actually close stdout and concentrating really closing it in one place at the end of samtools' and bcftools' main() functions will make life easier for pysam too.

(However I have a plan to rewrite pysam's I/O redirection in an upcoming release so that it will be immune to whether samtools/bcftools closes these file descriptors or not anyway.)

jkbonfield commented 1 year ago

I like this as a change. It does feel more natural to me, as as you state if the file wasn't opened by htslib then maybe its going too far for it to be doing the closing.

Also given we're paranoid and use fdatasync / fsync already, I expect even if people don't explicitly do a close-and-check themselves then almost all errors will have been spotted already (eg ENOSPC).

jmarshall commented 1 year ago

Hmmm… interesting point about fsync. POSIX is extremely vague about the circumstances in which close() might return EIO: “An I/O error occurred while reading from or writing to the file system”. MacOS's BSD-heritage man page says “A previously-uncommitted write(2) encountered an input/output error” which is slightly more helpful.

You may also be amused to hear that POSIX and Linux have opposite answers to “what happens if during dup2(), the as-if-by-close closing of the previous dest fd fails due to an I/O error?”…

jkbonfield commented 1 year ago

Thinking on things that can go wrong - devil's advocate if you wish. (I do infact like the idea in this PR)

  1. Potential for a broken tool changing behaviour, where we write non-HTS data to stdout after close, which was silently dropped on the floor in the past. Eg:
    while ((r = sam_read(in, h, b)) >= 0) {
    sam_write(out, h, b);
    n++;
    }
    hts_close(out);
    printf("Wrote %d records\n", n);

This would previously report the number of records written when given a file, or when given stdout it'd just lose the printf as stdout had already been closed. Now it wouldn't get closed and we'd have chit-chat appearing in the output file instead. This is quite similar to the occasional nohup bugs people hit due to mixing in stderr when using > file instead of -o file.

I'm not hugely concerned as I'd argue this sort of tooling is broken and also trivial to fix, but I wouldn't be surprised if this class of bug also exists in the wild too.

  1. Pipes don't get checked with fdatasync, so they may actually be a genuine source of data loss that we can no longer verify. Eg:
cat in.sam | strace -o tr ./test/test_view - | cat > /dev/null

(Note this also hints that we maybe should fix test_view too to have an explicit close.) The strace output shows no close of stdin (harmless) or stdout (more serious). It ends with:

read(0, "", 4096)                       = 0
write(1, "H7JKIB>ACCBF,AF@?BED>C6C1\tX0:i:1"..., 3046) = 3046
fdatasync(1)                            = -1 EINVAL (Invalid argument)
exit_group(0)                           = ?

So we see the issue with fdatasync not working here. In this regard, perhaps close would spot errors. Then again it's possible they'd just got lost in exactly the same manner. We're checking our writes, but it's the internals of the kernel which matter here. So we've sent some data down a pipe or socket, which succeeded at that point - we put it in the "todo" buffer basically - but later is found to fail. It's not something that is trivial to test either.

Given how we prefer -o file over > file, really pipes should be the primary source of writing to stdout.

jmarshall commented 1 year ago
  1. I also considered adding another option to override it, e.g., hts_open("-", "wO") (O for owned / one time) would make hts_close() really close stdout. Or leaving closing as the default, so that applications would have to write hts_open("-", "wS") to get the new behaviour.

    However it all seemed a bit complicated and unclean. I too would think that a program like you describe is broken.

  2. I always vaguely assumed the startup code would close stdin/out/err, but I see it just leaves It to exit_group to close them all. It wouldn't know what to do to report failures, so I guess there'd be no point in closing them explicitly.

    I've added an fclose to _testview.

    Re pipes, there's no delayed write to an underlying device going on, so I'd be surprised if close ever reported an error on a pipe. The failure mode would be more along the lines of writing to another process that abended prematurely — and you'd expect to see diagnostics (either from the program itself or whichever shell spawned it) about the other process's unfortunate end. (Or quite likely you'd still be writing to the pipe, so you'd get to emit a diagnostic about the EPIPE too.)

    In any case, by adding explicit closes at the end of main() in samtools and bcftools, things are equivalent to before this change — at least for those two programs.

jkbonfield commented 1 year ago

We seem to be in agreement on the various points, including what is valid and what is broken code that we need not support. Thanks for clarification on the pipe case too.

Agreed also on trying to override this with extra hts_open letters is unnecessary complication. Besides if it turns out to be a something required then it can still be done later. I'll move on to reviewing the samtools PR. Thanks