samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Avoid bgzf EOF check on modern MinGW releases. #1601

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

Unfortunately this test is proving unreliable now following a change to lseek so that it no longer returns -1 on pipes.

Fixes samtools/bcftools#1901

daviesrob commented 1 year ago

I think this is treating the symptom rather than the cause. The MinGW-w64 lseek64() seems to be a simple wrapper around _lseeki64(). According to the Microsoft _lseeki64 documentation, "On devices incapable of seeking (such as terminals and printers), the return value is undefined." So it's not clear how this ever worked...

I think the full solution might be an fstat() on the file descriptor in hopen_fd() and hdopen() to check if it's a regular file. The result can be added to the bitfield in hFILE_fd. Then fd_seek() could simply check this on Windows, and refuse to seek on non-regular files. This would catch all attempts to seek, and not just the BGZF EOF check.

jkbonfield commented 1 year ago

Undefined doesn't mean it never worked. It just means they're free to change their mind on implementation. The official posix definition of lseek though is clear, and MinGW has attempted to implement POSIX semantics even when the underlying microsoft interface doesn't. Obviously it can't catch all scenarios though.

My fix was the minimal and most simple solution to get bcftools CI working once more, so I think it's still worth doing for now.

A more invasive change to htslib will inevitably mean a longer development and testing time, especially given other commitments this week. Would you consider doing both?

Edit: also see https://github.com/samtools/bcftools/issues/1901#issuecomment-1503090263 last paragraph. Firstly, this worked with gcc 11 but not with 12. Upgrading gcc also upgrades all sorts of other associated stuff, so it's unclear if it's compiler or the libc which has the change, but regardless it demonstrates my comment about about mingw normally working around windows posix breakages, which is why we use it. Your alternative is also basically what I outlined too, although I hadn't thought of using fstat.

jmarshall commented 1 year ago

My fix was the minimal and most simple solution to get bcftools CI working once more, so I think it's still worth doing for now.

It's not minimal though: having bgzf_check_EOF_common() always return 2 means that you have to hack _testbgzf.c too. (And it appears it changes functionality on Windows, suppressing the EOF check for ordinary files too.)

There's an example of the sort of _WIN32-specific code needed next door in fd_write(); wouldn't it suffice to add similar in fd_seek():

 static off_t fd_seek(hFILE *fpv, off_t offset, int whence)
 {
     hFILE_fd *fp = (hFILE_fd *) fpv;
+#ifdef _WIN32
+    if (GetFileType((HANDLE)_get_osfhandle(fp->fd)) == FILE_TYPE_PIPE) { errno = ESPIPE; return -1; }
+#endif
     return lseek(fp->fd, offset, whence);
 }

That's equivalent to Rob's suggestion, except on Windows it does the check on every seek instead of doing it just once on Windows when opening the file and recording the answer.

daviesrob commented 1 year ago

Yes, that would be a simpler fix albeit less efficient as the check really only needs to be done once. I'd prefer to look for GetFileType((HANDLE)_get_osfhandle(fp->fd)) != FILE_TYPE_DISK as it would also catch character devices, which also can't seek.

I don't think MinGW has ever claimed to implement POSIX semantics. It's lseek() is just a wrapper around the Microsoft C runtime function, so what happens depends on what that does and not on anything in MinGW.

I expect cram_check_EOF() on a pipe would be affected as well as bgzf_check_EOF_common(). Fixing hseek() would sort that one too.

jkbonfield commented 1 year ago

I meant minimal in code. Obviously it had the side effect of not checking, but that was the whole point - the check no longer works due to mingw changes.

I wasn't aware of how to do this using windows native API, but that code snippet shows the way. Thanks.

jkbonfield commented 1 year ago

I wasn't aware of how to do this using windows native API, but that code snippet shows the way. Thanks.

The irony of this statement is, those lines of code were written by myself! Haha. Apparently I once did know how to do this, or managed somewhat better on search terms in google and/or stackoverflow to set the ball rolling. :)

Anyway I've confirmed it's fixing (again) my local mingw build of bcftools (glacially), so will work on getting this PR revised.

daviesrob commented 1 year ago

I've managed to reproduce the bug on Windows server 2019 using MinGW-w64 in both MINGW64 and UCRT64 modes, and can confirm that the revised patch fixes both.

An easy way to reproduce the problem was this:

cat  test/ce#large_seq.tmp.bam | ./test/test_view.exe -p /dev/null -

test/ce#large_seq.tmp.bam is made by running make test and at 1.7M is plenty big enough to trigger the issue.

At some point we may want to look into making file access on Windows use the more native APIs. The POSIX-like ones seem to have some quirks, e.g. msvcrt _open() returns error "No such file or directory" when trying to open a named pipe. The MSYS2 wc command can do that, so it must be possible somehow...

Meanwhile I'll merge this so that bcftools' tests will become more reliable again.