madler / zlib

A massively spiffy yet delicately unobtrusive compression library.
http://zlib.net/
Other
5.55k stars 2.42k forks source link

Unify all Windows build configuration on the same CharacterSet #888

Open arixmkii opened 8 months ago

arixmkii commented 8 months ago

Right now there are cases, when configuration sets Multibyte https://github.com/madler/zlib/blob/643e17b7498d12ab8d15565662880579692f769d/contrib/vstudio/vc17/miniunz.vcxproj#L45 but in other places it uses Unicode https://github.com/madler/zlib/blob/643e17b7498d12ab8d15565662880579692f769d/contrib/vstudio/vc17/miniunz.vcxproj#L50 and https://github.com/madler/zlib/blob/643e17b7498d12ab8d15565662880579692f769d/contrib/vstudio/vc17/zlibstat.vcxproj#L72 and in many it is left unset. I believe it would be beneficial to set all build configurations to the same default for the least surprise behavior. It seems that most common explicitly set in Multibyte, so, probably it should be explicit value everywhere.

irwir commented 7 months ago

Multibyte is obsolete, Unicode is the right way to go.

Mr-Clam commented 7 months ago

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

Neustradamus commented 7 months ago

@madler: Can you look this issue?

irwir commented 7 months ago

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that.

This is mostly correct, yet shows zero experience with Visual Studio and Windows. Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows. Use of multi-byte (DBCS) was deprecated since VS 2013.

arixmkii commented 7 months ago

According to this https://devblogs.microsoft.com/cppblog/mfc-support-for-mbcs-deprecated-in-visual-studio-2013/ it looks like they later undeprecated it, but still recommend Unicode for new code bases. Also, PowerShell team seems to be using Multibyte for their rebuild of ZLib https://github.com/PowerShell/ZLib/blob/20ce7b065269cda22beb1cd9ecc450be5ea99b8f/build/zlib.vcxproj#L56 most notably used by their OpenSSH fork. I don't think they would be using it if it would be completely deprecated.

Any downstream would be able to change the settings to match their specific needs in the rebuild. So, if it will be Unicode across the board for all projects it should be just fine.

madler commented 7 months ago

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

irwir commented 7 months ago

@arixmkii

it looks like they later undeprecated it, but still recommend Unicode for new code bases.

It was repeated twice that the warning would be removed. Even with continued support this is not exactly undeprecated. One incurable flaw is that two differnt Unicode strings could be converted to the same MBCS byte sequence.

irwir commented 7 months ago

Does this choice make any difference at all in the compiled zlib code? I believe that zlib does not use any types or functions that change their nature depending on Unicode vs. MultiByte. There is only explcit use of the wide character type and associated functions for gzopen_w(), and for error reporting.

Never had reasons to look at this part of the library because compression and decompression code do not depend on character width choice, and it worked well for me with Unicode. But there is difference when calling Windows API even indirectly. Explicit use (or mix) of wide and byte characters is always possible, but requires caution. For example, https://github.com/madler/zlib/blob/88ec24670e9c484219392def5f2f4b5e3f839db3/gzlib.c#L206 Copy of wchar_t string to char string usually would be the first character only, because second byte is null.

Another issue is Unix-ism 0666 https://github.com/madler/zlib/blob/88ec24670e9c484219392def5f2f4b5e3f839db3/gzlib.c#L233 By some luck this value includes desired flags, but properly it should be _S_IREAD | _S_IWRITE for read and write access.

Also, there are peculiarities when using printf with wide characters. Standard way of handling character width is using TCHAR and _T (or _TEXT) macros, and macros for calling functions, such as _topen.

In conclusion it might be said that conversion to Unicode is desirable, but cannot be done by simple substitution of parameters in VS projects.

madler commented 7 months ago

The question was "Does this choice make any difference at all in the compiled zlib code?"

To be explicit, do any of the bits in the resulting library differ if I change Unicode to MultiByte in the project file?

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

Neither strcpy() nor any printf() variant is ever called with a pointer to wide characters in zlib.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

A Unicode open function in zlib is already provided by gzopen_w().

Mr-Clam commented 7 months ago

Unicode is multi-byte. Unicode encoded as UTF-8 is variable length multi-byte at that. This is mostly correct, yet shows zero experience with Visual Studio and Windows. Use Unicode Character Setin VS projects implies UTF-16 LE;. and the same UTF-16 LE is internally used by Windows. Use of multi-byte (DBCS) was deprecated since VS 2013.

Actually, I have plenty of experience with this. I even benchmarked an application that did a lot string process about 20 years ago and found building for Unicode was slower. It seems the conversion in the all the "A" functions in the Win32 API is less expensive than doubling the size of all your strings with mostly null bytes. Of course, if zlib only uses char instead of something portable like TCHAR, the point is moot. Somebody has to convert char to wchar_t if there's a Win32 call - you either do it explicitly or it's done implicitly for you. For most users, using single byte is fine... until of course they have to use strings (e.g. filenames) that contain "foreign" characters that can't be expressed in their current active code page. Even Windows' double-byte Unicode support isn't as simple as you think: how does it handle surrogates? (That's a rhetorical question)

Looking at @madler 's comment, I'm wondering if defining Unicode is necessary at all, and also, what problem occurs if you don't define this?

irwir commented 7 months ago

It's those T and t functions and types that change behavior as a result, my point being that zlib doesn't use those.

In any case, convertion to multi-byte string was not necessary.

_S_IREAD | _S_IWRITE is the first 6 in the 0666.

This also sets Execute/search permission and other (possibly reserved for the future or used internally) bits.