madler / zlib

A massively spiffy yet delicately unobtrusive compression library.
http://zlib.net/
Other
5.55k stars 2.42k forks source link

Fix for issue of 32-bit z_off64_t with MinGW{-w64} while _LARGEFILE64_SOURCE=1 & _LFS64_LARGEFILE=1. #934

Closed dbjh closed 5 months ago

dbjh commented 6 months ago

Problem

The issue causes the 64-bit functions to effectively use a 32-bit type, with the obvious consequences. For example, gztell64() will not return a correct value for (gzipped) files larger than 4GB if the file pointer has been moved beyond that point. The problems occurs with MinGW and MinGW-w64 (GCC & Clang). The 64-bit functions are only available when the defined constants _LARGEFILE64_SOURCE and _LFS64_LARGEFILE are non-zero. In zlib.h it is stated that

provide 64-bit offset functions if _LARGEFILE64_SOURCE defined

and

in case these are set on systems without large file support, _LFS64_LARGEFILE must also be true

With MinGW-w64 the problem can be worked around by defining _FILE_OFFSET_BITS as 64 and defining {Z_}HAVE_UNISTD_H, but that does not match with those statements -- only defining _LARGEFILE64_SOURCE and _LFS64_LARGEFILE as non-zero should be enough.

Cause

In case of MinGW, z_off_t is always 32-bit and _FILE_OFFSET_BITS is ignored. The file zconf.h contains a test, defined(_WIN32) && !defined(__GNUC__) to guard a reference to __int64. Perhaps based on the assumption that GNU C compilers do not support that type, while other Windows compilers do. That assumption is at least partially incorrect. To make that piece of preprocessor logic work correctly for MinGW, z_off64_t should be defined as __int64.

Solution

As far as I know, all ports of GCC on Windows (i.e. defining _WIN32) are (part of):

Clang is not a port of GCC, but like GCC defines __GNUC__, usually. Except for Cygwin, they all support __int64 on Windows.

I realize there are more details regarding __GNUC__ on Windows. Clang may not always define it (https://blog.conan.io/2022/10/13/Different-flavors-Clang-compiler-Windows.html). If a compiler does not define __GNUC__, it either also supports __int64 or it cannot compile zlib in its current form (if Z_LARGE64=1 and/or _LARGEFILE64_SOURCE=1 and _LFS64_LARGEFILE=1). Clang as part of Cygwin does not support __int64. Clang as part of MinGW-w64 does.

So, although a positive test for compilers that support __int64 may be better, it is safer to stay as close to !defined(__GNUC__) as possible, i.e., just expand the set of compilers that will refer to __int64 with the ones mentioned above. In my pull request I propose defined(_WIN32) && !defined(__CYGWIN__). That solves the problem for MinGW and MinGW-w64 (GCC & Clang) and does not introduce a problem for Cygwin (GCC & Clang).

Note that MinGW-w64 (GCC & Clang) does not solely depend on __int64, because there z_off_t can be 64-bit. However, as mentioned above this only happens when _FILE_OFFSET_BITS=64 and {Z_}HAVE_UNISTD_H defined, which should not be a requirement (similar to how it is no requirement with Cygwin or on 64-bit Linux).

The attached code tell.zip can be used to demonstrate the problem and as verification that the problem is not present or has been solved. Example: tell file_larger_than_4gb.

Update (2024-02-27)

I looked a bit deeper into support for __int64 by GCC and made the following findings:

The above explains the behavior I described for Clang depending on the environment (MinGW-w64 versus Cygwin).

Availability of __int64 with GCC depends on inclusion of certain types of header file, which can be a problem (compilation error) if ZSOLO is defined. So, my fix required a test whether \_int64 is defined. That is why I changed: && !defined(__CYGWIN__) into: || (defined(__GNUC__) && defined(__int64)) There are of course numerous alternative ways to solve this issue, but this way may be the smallest modification. In the context of this issue zconf.h itself includes the standard C header file unistd.h, so for MinGW and MinGW-w64 availability of __int64 is guaranteed.

If windows.h is included (zconf.h includes windows.h under certain conditions) __int64 is available with Cygwin too. Note that this issue doesn't occur with Cygwin.

dbjh commented 6 months ago

For ease of testing with a file larger than 4GB I have attached a gzipped file of 5,000,000,000 bytes (4,852,396 bytes compressed). The file consists entirely of bytes with value 0, except the byte at offset 0x100000000 (4GB), in order to help detect that all data is read (instead of internal file pointers of zlib wrapping around at 4GB). The expected output of ./tell bigfile.gz with tell.c compiled with either MinGW or MinGW-w64 (GCC & Clang) and zlib 1.3.1 with my pull request applied is:

Using zlib version: 1.3.1 File position: 5000000000; sizeof (z_off64_t): 8

vszakats commented 6 months ago

defined(_WIN32) && !defined(__CYGWIN__)

Out of curiousity, which specific environment is defining both _WIN32 and __CYGWIN__ at the same time?

dbjh commented 6 months ago

defined(_WIN32) && !defined(__CYGWIN__)

Out of curiousity, which specific environment is defining both _WIN32 and __CYGWIN__ at the same time?

There are a couple of different scenarios where that is the case for Cygwin.

  1. When _WIN32 is defined somewhere before including zconf.h. This used to be the case for (32-bit) Cygwin after including certain headers from the w32api directory, for example windows.h. windows.h used to include windef.h: https://www.cygwin.com/cgit/newlib-cygwin/tree/winsup/w32api/include/windef.h?h=cygwin-1_7_17-release#n32 This is no longer the case. Nowhere in the current version of the Win32 headers is _WIN32 defined, but one of zlib's goals appears to be backwards compatibility with old systems. In later versions of the Win32 headers WIN32 is still defined in minwindef.h (included by windows.h).
  2. When the GCC option -mwin32 is specified. See https://cygwin.com/faq.html#faq.programming.preprocessor and https://gcc.gnu.org/onlinedocs/gcc/x86-Windows-Options.html. This applies to both 32-bit and 64-bit Cygwin.
  3. When the GCC option -D_WIN32 is specified. This is basically a more portable version of -mwin32. This applies to both 32-bit and 64-bit Cygwin.
vszakats commented 6 months ago

Interesting, thanks!

I understand this is to protect against a false _WIN32 signal, which seems more prone to happen under Cygwin specifically. zconf.h being public, might be affected.

dbjh commented 6 months ago

Interesting, thanks!

I understand this is to protect against a false _WIN32 signal, which seems more prone to happen under Cygwin specifically. zconf.h being public, might be affected.

You are welcome.

I am not sure what you mean by "false _WIN32 signal". Cygwin implements POSIX and as explained in the second link in my previous post, does not define _WIN32 to avoid certain problems. However, it also allows access to the Win32 API. Especially the headers in the w32api directory contain a lot of conditional code guarded by tests whether _WIN32 is or is not defined. In order to use certain Win32 functionality _WIN32 has to be defined. Originally, including windows.h was taken as an indication that the program intended to use the Win32 API and windef.h would define _WIN32 to make that possible. Apparently the Cygwin developers changed their minds later and concluded that it was better to make the programmer explicitly define it. My point being that the combination of both __CYGWIN__ and _WIN32 defined is a valid state.

My PR only concerns the issue of 64-bit variants of zlib functions effectively being 32-bit with MinGW and MinGW-w64 while not introducing new issues for other environments. It does not concern whether zlib works properly with Cygwin with _WIN32 defined or not. With Cygwin I have only tested with _WIN32 undefined and have not encountered any issues.

vszakats commented 6 months ago

By false signal I mean that _WIN32 is the signal to indicate that the C compiler is targeting native Windows. Cygwin is not native Windows. If this is defined manually or via manual options in such environment, it misrepresents the situation. But, of course this isn't so clear cut, because the Windows API is available under Cygwin and surely someone will want to use it, and in such a hybrid scenario it's up to anyone's imagination what _WIN32 should really mean (targeting native Windows? using windows.h?). As I understand, by design _WIN32 isn't supposed to be a user option or be defined manually, and not even via Windows API headers (there is WIN32 for that). I think Cygwin devs made the correct decision to fix that in the headers eventually.

Either way, since this happens in existing code that will include zlib.h and zconf.h with it, it makes perfect sense to take this combination into account here (and in public headers in general).

dbjh commented 6 months ago

Either way, since this happens in existing code that will include zlib.h and zconf.h with it, it makes perfect sense to take this combination into account here (and in public headers in general).

I agree. If there are any issues with that, they can be addressed in a new PR. This PR does not affect any such issue.

sezero commented 5 months ago

MinGW compiler doesn't define __int64, it is defined in _mingw.h (which you already stated), and zconf.h isn't directly included but via zlib.h (first include in there). So, if a source includes zlib.h first, it will never hit this in a mingw build. Or am I missing something?

vszakats commented 5 months ago

All mingw-w64 versions (1.0 and up) support __int64. Is there a cleaner way to detect mingw-w64? Without relying on _mingw.h, ideally?

I don't know about old-mingw's __int64 support, but in case it also supports this, defined(__MINGW32__) could do the trick.

sezero commented 5 months ago

All mingw-w64 versions (1.0 and up) support __int64. Is there a cleaner way to detect mingw-w64? Without relying on _mingw.h, ideally?

I don't know about old-mingw's __int64 support, but in case it also supports this, defined(__MINGW32__) could do the trick.

All mingw compilers define __MINGW32__. The only was to differentiate mingw and mingw-w64 is including _mingw.h and checking for existence of __MINGW64_VERSION_MAJOR.

However, old mingw does support __int64 and #define it in _mingw.h (mingw-w64 actually inherits that behavior from old mingw.) Both old mingw and mingw-w64 defines __int64 as long long

vszakats commented 5 months ago

Thanks @sezero! (also for your mingw-w64 builds)

Looking at the the guard: _WIN32 already covers both mingw-w64/old-mingw unconditionally. It makes me wondering which non-Windows compilers would be covered by the additional (defined(__GNUC__) && defined(__int64)).

#elif defined(_WIN32) || (defined(__GNUC__) && defined(__int64))

edit: after reading the updated commit message. For old-mingw/mingw-w64, the current patch still requires _mingw.h which isn't explicitly included and thus may or may not be present. To fix that, is there any reason why this wouldn't work?:

#elif defined(__MINGW32__)
#  define z_off64_t long long
#elif defined(_WIN32)
#  define z_off64_t __int64
#elif [...]

More clues in curl's system.h, which also handles ancient MSVC and Borland C versions that define _WIN32 but lack an __int64: https://github.com/curl/curl/blob/b6006381fb9f397b765e8fcd88b1d5764fac0612/include/curl/system.h#L364

madler commented 5 months ago

Applied. Thanks.

sezero commented 5 months ago

What is applied? Not the patch in this P/R I hope..

vszakats commented 5 months ago

Doesn't seem to be applied, but hard to say for sure without any references.

The last update to zconf.h was a different patch, for DJGPP: https://github.com/madler/zlib/commit/04134633fa6f8a9ac87ffd4b2683ca61bd41acd6

Reading existing code, mingw-w64 (default, non-Z_SOLO) builds set z_off64_t to off_t (== long long). This needs unistd.h, which is detected via CMake and should also be detected by ./configure. This seems OK for mingw-w64.

For old-mingw it depends on the off_t size there. For Z_SOLO builds it's not OK, but changing it would be an ABI break?

madler commented 5 months ago

What is applied? Not the patch in this P/R I hope..

Applied, but not pushed yet. Yes, the patch here, with that line changed to:

#elif defined(_WIN32) || (defined(__GNUC__) && defined(__int64))

What's wrong with it?

sezero commented 5 months ago

Applied, but not pushed yet. Yes, the patch here, with that line changed to:

#elif defined(_WIN32) || (defined(__GNUC__) && defined(__int64))

What's wrong with it?

As I explained above it is broken.

dbjh commented 5 months ago

Applied, but not pushed yet. Yes, the patch here, with that line changed to: #elif defined(_WIN32) || (defined(__GNUC__) && defined(__int64)) What's wrong with it?

As I explained above it is broken.

It has a flaw, but you did not explain. You actually asked whether you missed something and you did. There is a lot of context which is easy to miss. This PR addresses a specific issue and that is what it does. You forgot that zconf.h can include headers itself. In the context of this issue it includes unistd.h, which causes as I explained in excruciating detail _mingw.h to be included. If the condition defined _LARGEFILE64_SOURCE && defined _LFS64_LARGEFILE evaluates to zero it is unclear to me whether there is a hard requirement for z_off64_t to be 64-bit. At least for a long time this was not the case in the context of this issue, so the requirement cannot be that hard. The test for whether __int64 has been defined is to make sure the code compiles when Z_SOLO has been defined. However, if Z_SOLO has been defined certain features will be unavailable. For example, z_off_t will not be properly defined for all cases (_FILE_OFFSET_BITS no longer applies). As I stated above, there are numerous alternative ways to solve this issue, but this way may be the smallest modification.

Update: Z_SOLO is a pain. _WIN32 is defined for MinGW{-w64}, so if Z_SOLO is defined, the second part of the condition will not be tested and cause z_off64_t to be defined as __int64 (which is an unknown identifier in that case). The patch by @vszakats in his post should work.

Update 2: My updated patch to deal with Z_SOLO is: #elif (defined(_WIN32) && !defined(__GNUC__)) || (defined(__GNUC__) && defined(__int64))

Update 3: Since this PR has been closed, for the sake of completeness my latest commit.

vszakats commented 5 months ago

Update 2: My updated patch to deal with Z_SOLO is: #elif (defined(_WIN32) && !defined(__GNUC__)) || (defined(__GNUC__) && defined(__int64))

The question remains: why use __in64 for mingw* (and cygwin), when a native long long type is there without any of the complexity and uncertainty of __int64? (and offers the exact same type, always)

Relying on __int64 being a macro is brittle, plus it expects _mingw.h to be included indirectly via certain headers, which is brittle too.

dbjh commented 5 months ago

Update 2: My updated patch to deal with Z_SOLO is: #elif (defined(_WIN32) && !defined(__GNUC__)) || (defined(__GNUC__) && defined(__int64))

The question remains: why use __in64 for mingw* (and cygwin), when a native long long type is there without any of the complexity and uncertainty of __int64? (and offers the exact same type, always)

Relying on __int64 being a macro is brittle, plus it expects _mingw.h to be included indirectly via certain headers, which is brittle too.

The benefits of my change:

The benefits of your change:

Relying on int64 is not brittle as I have extensively explained in the description. MinGW, MinGW-w64 and Cygwin all rely on the type in their Win32 API headers. Support for int64 is also a requirement for compatibility with Visual C++, which is the de facto standard compiler on Windows. zconf.h relies on __int64 being available for a set of compilers that define _WIN32. My change does not rely on int64 being a macro -- it tests for that with `defined(int64)`. The way support for __int64 is implemented (through _mingw.h) is somewhat unfortunate, but clearly defined. The fact that _mingw.h is included is an implementation detail, so it is unsurprising that this is relatively unknown.

vszakats commented 5 months ago

__int64 being a macro is a vendor-specific implementation detail which may or may not stay. Making sense of defined(__int64) also assumes intimate knowledge of these details.

On the other hand __MINGW32__ is the standard way to guard mingw-specific code. Widely used, documented, unlikely to ever change and familiar to all readers of portable code.

Adding 2 more short lines wins in readability (and extendability) IMO.

dbjh commented 5 months ago

__int64 being a macro is a vendor-specific implementation detail which may or may not stay. Making sense of defined(__int64) also assumes intimate knowledge of these details.

On the other hand __MINGW32__ is the standard way to guard mingw-specific code. Widely used, documented, unlikely to ever change and familiar to all readers of portable code.

Adding 2 more short lines wins in readability (and extendability) IMO.

Look, I do not care much which implementation @madler will choose in the end, but at least use honest arguments. Suggesting that Microsoft will suddenly decide to remove int64 is ludicrous. Also, perhaps MinGW, MinGW-w64 and Cygwin would not even follow such a change. It is mere speculation. Regardless, int64 is used in an enormous codebase that will not go away anytime soon. I agree that defined(__int64) may not be immediately obvious, but it is also not that complicated either. The argument of defined must be a macro. BTW as opposed to what was suggested above, testing for MinGW-w64 does not require inclusion of a header (at least, not in the last several years). Both GCC and Clang define the "standard" __MINGW64__ :-)

sezero commented 5 months ago

Suggesting that Microsoft will suddenly decide to remove __int64 is ludicrous.

No one suggested that...

Both GCC and Clang define the "standard" MINGW64 :-)

Really now?? __MINGW64__ is defined by mingw only when targeting 64 bit windows.

Geez..

dbjh commented 5 months ago

Suggesting that Microsoft will suddenly decide to remove __int64 is ludicrous.

No one suggested that...

Both GCC and Clang define the "standard" MINGW64 :-)

Really now?? __MINGW64__ is defined by mingw only when targeting 64 bit windows.

Geez..

Thanks for the contructive feedback. Yes, they do, depending on the context. But to clarify it for you, the (standard) 64-bit GCC and Clang as part of MinGW-w64 define __MINGW64__ as 1. What you probably referred to is the 32-bit GCC as part of MinGW-w64. Probably, because MinGW (not mingw) is a trademark and the owner is rather protective about it. Also MinGW-w64 does not target anything, the specific toolchains do. I have seen the 32-bit GCC of MinGW-w64 being referred to as MINGW32, but there does not seem to be a consensus about it. According to Wikipedia MinGW was once called mingw32, so this causes confusion. As far as I know an important reason to use MinGW is to target Windows versions older than Windows XP, something that is no longer supported by the standard 32-bit GCC as part of MinGW-w64. Using the latest version of GCC as part of MinGW also results in incompatibilties with Windows 95 when compiling C++ code.

sezero commented 5 months ago

Geez. That 'Unsubscribe' button is my friend here.

vszakats commented 5 months ago

I never suggested anything about Microsoft (where __int64 is a native type). Speaking of mingw and its __int64 macro, which part of my argument was "dishonest"?

vszakats commented 5 months ago

@dbjh: __MINGW32__ identifies old-mingw/mingw-w64 targets for all CPUs. This has been so for the last 20 years. More details here: https://github.com/cpredef/predef/blob/master/Compilers.md#mingw-and-mingw-w64

Seconding @sezero here with the hope to have this reverted.

dbjh commented 5 months ago

@dbjh: __MINGW32__ identifies old-mingw/mingw-w64 targets for all CPUs. This has been so for the last 20 years. More details here: https://github.com/cpredef/predef/blob/master/Compilers.md#mingw-and-mingw-w64

Seconding @sezero here with the hope to have this reverted.

This discussion got off-topic. I identified a longstanding issue in a fundamental feature for which I proposed a change to a single line. I provided the rationale and all relevant background material, including the relevant history of MinGW, the predecessor of MinGW-w64. It took me several tries to also deal with situations that even the author sometimes makes mistakes with, but the final version is decent. I acknowledged that there many ways to solve the issue and I even pointed to your alternative, although it has its own cons. However, the attitude that I was greeted with leaves somewhat of a bad aftertaste, and now that attitude is leading to the point of rejecting a solution. It is unprofessional. This does not help the zlib project.