msys2 / MINGW-packages

Package scripts for MinGW-w64 targets to build under MSYS2.
https://packages.msys2.org
BSD 3-Clause "New" or "Revised" License
2.31k stars 1.23k forks source link

binutils: .ident spam in executables #21

Open snaury opened 10 years ago

snaury commented 10 years ago

Since gcc version 4.8 gcc started to include .ident directives in assembly files, which binutils gas converts into strings into .rdata$zzz for COFF targets (i.e. windows). The problem however that basically every object ends up with a string like GCC: (Rev6, Built by MSYS2 project) 4.8.2, and when linking all of them are tucked at the end of the .rdata section. The problem here is with section flags. On ELF (i.e. linux) the section is added with SEC_MERGE flag, so the resulting binary only contains a handful duplications (when different objects are compiled with different compilers or their versions). However for COFF no such deduplication occurs, which can easily result in long chains of hundreds of identical strings. While I support gcc's desire to advertise itself I worry this is way too excessive.

Not sure how to fix this, it's obviously an upstream bug, and wasn't visible because there weren't many comments before. Now there are. And if gas is ever fixed, all libraries would have to be recompiled, since section flags are in those object files.

What do you think? See for example how many of those comments are in gcc.exe (which is simply a big executable with lots of linked object files).

Alexpux commented 10 years ago

Some time ago we have similar questions as I remember on forum sources.ru to our mingw-builds project. But no one solution was found. All that we can do here now is just remove configure parameter. But in this case you can't understand what build do you use.

snaury commented 10 years ago

Removing build parameter wouldn't help, it would simply change the exact string, nothing else.

Alexpux commented 10 years ago

Ok. Then maybe you can fill bug report for binutils?

katlogic commented 9 years ago

echo *.o *.a | xargs -n 1 objcopy.exe -R '.rdata$zzz' but bleh ...

vszakats commented 7 years ago

This may be resolved by passing option -fno-ident to the compiler.

These ident records not only add spam and increase binary sizes, but more importantly they make it more difficult to create reproducible binaries across platforms. The reproducibility issue could be mitigated by just leaving the ident value to the default in MSYS2 builds — this would help because in Linux builds (that I know of) and macOS (Homebrew) builds, it's also left to its default value (e.g. GCC: (GNU) 7.2.0).

cnlohr commented 6 years ago

I really appreciate the work-around, @vszakats but man that is weird that this problem exists at all.

simondeschenes commented 5 years ago

So, -fno-ident removes the .rdata$zzz section, but you need to recompile all the libraries separately in order to get it finally working.

It seems ridiculous that -fno-ident is not set as the default for compiling on Windows then. Can we do something about this ?

I made some quick use of objcopy to remove some of them from libgcc.a, but there are still many still there. (I suppose, I'd have to extract all of the .so files, objcopy them and then reassemble everything).

Well, libgcc.a went from 2.7Mb to 622Kb. So, I am pretty sure it removed way more than just those duplicates and I now have an unusable libgcc library.

Can't this get fixed by having gcc use -fno-ident by default ?

Before moving to MSYS2, I was using mingw-builds. It looks like it is no longer maintained, this is why I moved.

mingw-builds was using -fno-ident by default on every compilation : https://github.com/niXman/mingw-builds/blob/master/library/config.sh

vszakats commented 5 years ago

It was kindly implemented by niXman as a follow-up to this report: https://github.com/niXman/mingw-builds/issues/412. Same solution would work here as well I'm guessing.

oscarfv commented 5 years ago

This request would be more compelling backed by some numbers. I just tested -fno-ident with a C++ project with 26 object files and the .dll size decreased about 1000 bytes... which is 0.1% of the initial size.

In theory, projects such as LLVM or Qt, which consists of thousands of object files, would benefit more from -fno-ident, but actual numbers are better than theory.

Finally, someone (TM) could file a bug report on binutils' tracker mentioning that strip on Windows does not remove the duplicate ident strings as it does on Linux. Fixing strip is the best solution, because those strings can be very useful when tracking mismatched libraries.

simondeschenes commented 5 years ago

@oscarfv, actually, on windows, in the COFF format, instead of being in the .ident section, they are inside of the .rdata section (.rdata$zzz subsection) and strip can't actually safely strip it. There are sometimes very important data that is needed by the exe in the .rdata$zzz section which will break an executable if removed.

The problem is not binutils not stripping the section, the problem is gcc adding the .ident tag in the .rdata$zzz section for the COFF windows format (because there is no .ident section in that format).

Even if you told binutils to remove it, it would be technically impossible for it to safetly identify if the data is needed or not. .rdata is "Read-only initialized data" and is defined as needed by the executable. Only the .debug sections are safely removable in the PE COFF format.

https://docs.microsoft.com/en-us/windows/desktop/debug/pe-format

In the case of a simple hello world executable, the string is already present about 20 times in the executable. It is present more than a hundred time in libgcc.a.

Yes, this is still just a few kilobytes, but it is still a few kilobytes too many that will actually be loaded in memory with the rest of the executable.

The -fno-ident is currently the only solution, except if gcc start adding its identifier inside of the .debug section.

oscarfv commented 5 years ago

Well, my point about reporting the issue upstream stands: is there a bug report on gcc's tracker about this problem?

vszakats commented 5 years ago

I believe any data that serves no purpose is best to be eliminated from generated/shipping binaries, if there is way. It this case there is an easy way, with no known drawbacks. The fix is also fit as a temporary solution till slower moving upstream projects such as GCC decides to change the default and/or is able to replicate *nix behaviour on Windows (such that strip actually removes these strings).

As for numbers it depends on the ident text, for example this text taken from an earlier MSYS2 version GCC: (Rev1, Built by MSYS2 project) 7.2.0 takes 40 bytes plus some, which should be multiplied by the number of object files (build without -fno-ident) linked to the resulting executable. In case of libiconv2-2.dll, which is a fairly small file (1 MB), the bloat is 1.6-2K. On my MSYS2 main bin directory there are 470 files, so it adds up quickly. In the past the most common "complaint" from potential MinGW users (whom I tried to explain the merits of this great toolchain) was the larger output size compared to other C compilers (notably MSVC) on the platform, so I took every opportunity to remove unnecessary bloat. However large/small these may be.

(Since -fno-ident fully resolved the issue on Windows, I haven't consulted GCC tracker about this.)

oscarfv commented 5 years ago

Those ident strings are there for a reason: to identify the compiler that built the object file and thus avoid or detect incompatible mixtures of libraries. On MinGW(-w64) this is specially important because there are so many variants out there, with different combinations of exception model, threading model and C++ ABI (too bad that the info does not include compiler options that modify the ABI).

Personally I'm not opposed to add -fno-ident to the compiler defaults, but the 0.1% of size improvement plus removing the means for checking which compiler built a given library makes me not interested on creating a PR either, which is the best way to advance things on this type of projects.

vszakats commented 5 years ago

In my (fairly long) experience with MinGW/GCC, the ident string never contains info such as exception/threading/ABI/build-options, only the packager project in some cases (like with MSYS2) and the version of the GCC compiler. It's highly doubtful this string is actively used by any part of the toolchain by actually reading that information, extracting the version number (or other data) from the freeform text and even acting upon it. I've never seen that happen and it'd be a rather flakey solution indeed, if that's the idea behind it.

LTO builds will reject linking objects with incompatible LTO metadata versions, but that's using an independent mechanism by reading this information from the object files (which information is not included in final executables!).

Even if the toolchain would offer a feature to warn about objects linked together with incompatible compiler properties, it makes no sense to include those properties in the final executables — it's well enough to include them in the object files only.

I can imagine that a human readable GCC version text may be useful for debugging/troubleshooting purposes in some cases, but for such purpose it'd be enough to contain the (or each, if there are multiple) ident text once, as opposed to dozens or hundreds of times. None of these solutions would answer which modules were built with which ident text identifier though, so it's quite a limited input.

That is why in its current form, it's merely and ad/spam text that serves limited if any purpose, and doing so overly verbosely.

If MSYS2 maintainers are interested to see the compiler rev/version in all executables built with their build of the toolchain, it'd be more elegant to include a single version string in one of the objects that is always present in the final executables. Using -fno-ident on all but a few selected objects would also be an improvement.

But, I still fail to see a compelling reason why the inclusion of this is important on Windows, and not important at all on other OS platforms and with other popular Windows C compilers.

oscarfv commented 5 years ago

Now I see that the info about exception mechanism and threading model is missing. Well, as I said, make a PR that disables ident by default and let's see what the maintainer thinks. But, again, 0.1-0.2% space savings is not terribly compelling.

vszakats commented 5 years ago

To clarify on the 0.1-0.2% figure (which apparently came from the single example I pulled in above), this depends on how many objects formed the output executable and the total size of it, so it may be easily higher than that. To quote an actual example, BusyBox for Windows claimed to achieve 3% size reduction by enabling this option: https://github.com/rmyorston/busybox-w32/issues/51

simondeschenes commented 5 years ago

Where should I go from there in order to get this fixed ? Should I make a pull request ?

oscarfv commented 5 years ago

@simondeschenes : That's what I would do. And, as a long term solution, a bug report on GCC bugzilla suggesting to move .ident to .debug section for targets that use COFF. Or, best of all, implement the later, send it as a patch to GCC and as a PR here :-)

Rinat84 commented 5 years ago

"D:\src\gcc-5.5.0\gcc\toplev.c"

  /* Attach a special .ident directive to the end of the file to identify
     the version of GCC which compiled this code.  The format of the .ident
     string is patterned after the ones produced by native SVR4 compilers. 
  if (!flag_no_ident)
    {
      const char *pkg_version = "(GNU) ";
      char *ident_str;

      if (strcmp ("(GCC) ", pkgversion_string))
    pkg_version = pkgversion_string;

      ident_str = ACONCAT (("GCC: ", pkg_version, version_string, NULL));
      targetm.asm_out.output_ident (ident_str);
    } */

"D:\src\gcc-5.5.0\gcc\c-family\c-lex.c"

  /* if (!flag_no_ident)
    {
      Convert escapes in the string.
      cpp_string cstr = { 0, 0 };
      if (cpp_interpret_string (pfile, str, 1, &cstr, CPP_STRING))
    {
      targetm.asm_out.output_ident ((const char *) cstr.text);
      free (CONST_CAST (unsigned char *, cstr.text));
    }
    }  */
Alexpux commented 5 years ago

So if all will be happy with currently adding -fno-ident to build flags then i will do it globally in makepkg

simondeschenes commented 5 years ago

With all the overtime I am currently doing at work, I just never got time to make a pull request. Thank you @Alexpux for taking care of it.

oscarfv commented 5 years ago

What @Alexpux is proposing is not the same as some people requested here. People complain about .ident in general, not only in the binary packages distributed by MSYS2/Mingw-w64.

Adding -fno-ident to makepkg avoids having those strings in the binaries we download here, but not from the binaries we build. The real fix is to patch binutils for our target. A quick looks points to the function obj_coff_ident Completely untested patch follows:

modified   gas/config/obj-coff.c
@@ -553,10 +553,7 @@ obj_coff_ident (int ignore ATTRIBUTE_UNUSED)
     /* We could put it in .comment, but that creates an extra section
        that shouldn't be loaded into memory, which requires linker
        changes...  For now, until proven otherwise, use .rdata.  */
-    sec = subseg_new (".rdata$zzz", 0);
-    bfd_set_section_flags (stdoutput, sec,
-              ((SEC_ALLOC | SEC_LOAD | SEC_READONLY | SEC_DATA)
-               & bfd_applicable_section_flags (stdoutput)));
+    sec = fetch_coff_debug_section();
   }
 #else
   subseg_new (".comment", 0);
vszakats commented 5 years ago

MSYS2/Mingw-w64 (and related packages the form the toolchain) are a little bit "more special", because its .ident strings are inevitably leaking into any project that use this compiler. Most trivially via the startup object, but also via other objects/libs that the C compiler links by default/automatically.

You're right though that the best would be to enable the option for every MSYS2 package build. If MSYS2 has a concept of a default/customisable CFLAGS/CPPFLAGS value, it may help solving it.

simondeschenes commented 5 years ago

I also like the what @oscarfv code tries to do. It tries to put the .ident inside of a .debug section. Stripping will remove every .debug section and should also remove the .ident string. However, it keeps the .ident string as part of the debug information.

I think it is a nice idea to explore in order to get a real fix going upstream to binutils.

vszakats commented 5 years ago

.debug section would be a nice place for this information indeed. (Or, for that matter any other section that gets stripped by strip.)

Rinat84 commented 2 years ago

The patch disables "-fident", which is enabled by default, but leaves it available, along with "-Qy" fident_distable.patch

jayrm commented 1 year ago

@simondeschenes

There are sometimes very important data that is needed by the exe in the .rdata$zzz section which will break an executable if removed

Do you know of any resource that might list or document other uses of .rdata$zzz section besides .ident?

simondeschenes commented 1 year ago

@simondeschenes

There are sometimes very important data that is needed by the exe in the .rdata$zzz section which will break an executable if removed

Do you know of any resource that might list or document other uses of .rdata$zzz section besides .ident?

IIRC, I used "strip -R .rdata$zzz myobjectfile.o" and it essentially completely broke the objects completely. I then tried to just manually set the complete "rdata$zzz" sections manually to all zeroes with an hex editor as a test and it also broke the object file.

This is where I stopped when I was trying to just strip that section from the object files.

edit: I think it is important to note what happened since I last commented here. We moved to a LLVM/clang custom toolchain instead of a GCC + binutils toolchain because of the much smaller exec size even when using -fno-ident.

I was very surprised when I received a mail with your message. This issue has been open for more than 9 years, almost 10 and is still not fixed...

I guess projects who have strict executable size requirements all moved away from msys2.

jayrm commented 1 year ago

IIRC, I used "strip -R .rdata$zzz myobjectfile.o" and it essentially completely broke the objects completely. I then tried to just manually set the complete "rdata$zzz" sections manually to all zeroes with an hex editor as a test and it also broke the object file.

Thank-you for your response. We were thinking to discard the section with a custom linker script as part of our build process - which seemed to work in tests, but really don't know for certain that it is a valid workaround, so was hoping for a confirmation one way or the other.

We use other some other toolchains besides msys2 and seen a similar accumulation of .ident strings due to those toolchain's supporting libraries, plus in our own libraries not having ever thought about -f[no-]ident until recently.

We use msys2 to help support our build environment but then use a variety of other toolchains (mingw-w64 variants) depending on what's to be targeted.

paranormalized-encryptid commented 5 months ago

Any news on this front? I've just installed the latest version of MSYS2, struggled to get the MinGW64 toolchain working due to this issue, and finally managed to statically compile the three ffmpeg libraries I needed in a very lean build only to find that my <500kb DLL has accumulated a total of 17 duplicate .ident strings from them. I really want to prevent this, but it sounds like the only way to do so right now is to build the MSYS2 MinGW64 package from scratch with a patch? Is that correct, and if so will a fix be available soon? Sorry if that's a newb question, but I had to uninstall and reinstall MSYS2 four or five times to get it working at all so I’m hesitant to go through all that if an alternative exists.

simondeschenes commented 5 months ago

It is almost correct.

There is no fix coming here because it there was one, it would already have been done. I mean, adding -fno-ident to the options when building GCC isn't something that takes a decade.

The GNU Binutils project considers using -fno-ident to be the correct solution as all sections in the PE COFF format except for the rdata ones will be loaded in memory. There really is no valid space in the executable format for an ident string.

Using strip to remove the section hasn't been sucessful for me, so do 't hope it will work.

However, a custom linker script seems to work fine (-T yourscript.lds). This is what they do on some projects like: https://github.com/tomsons26/ts-patches/blob/326db731f7226d9e803feab475777c730688634e/tibsun.lds

And there are mingw-w64 toolchains compiled with -fno-ident like: https://github.com/niXman/mingw-builds-binaries/releases

paranormalized-encryptid commented 5 months ago

Thank you for responding @simondeschenes. I'm sorry I took so long to reply -- I wanted to test building MingGW with some of the patches mentioned in this issue before responding so I could report back and help others looking for a fix, but unfortunately hot weather, antivirus stupidity, multiple crashes and someone cutting a hedge trimmer cord and shorting the power right when I was in the middle of building has meant that that hasn't happened yet.

It is almost correct.

There is no fix coming here because it there was one, it would already have been done. I mean, adding -fno-ident to the options when building GCC isn't something that takes a decade.

The GNU Binutils project considers using -fno-ident to be the correct solution as all sections in the PE COFF format except for the rdata ones will be loaded in memory. There really is no valid space in the executable format for an ident string.

I'm aware, but -fno-ident does not prevent libraries compiled with MinGW from filling up with the .ident spam -- especially when they produce a lot of object files like ffmpeg -- which is still a lingering issue. It certainly didn't seem to stop it when I used one of niXman's builds, as helpful as they otherwise are.

Using strip to remove the section hasn't been sucessful for me, so do 't hope it will work.

However, a custom linker script seems to work fine (-T yourscript.lds). This is what they do on some projects like: https://github.com/tomsons26/ts-patches/blob/326db731f7226d9e803feab475777c730688634e/tibsun.lds

And there are mingw-w64 toolchains compiled with -fno-ident like: https://github.com/niXman/mingw-builds-binaries/releases

Thanks so much for this, this is exactly the kind of direction I needed. I wasn't sure if discarding the .rdata section entirely might cripple the object file, so that's very promising! If all else should fail, do the .ident strings also build up with Clang? My understanding is that Clang doesn't use GNU binutils, but I'm wondering if it suffers a different version of the problem thanks to the awkwardness of the COFF format.

simondeschenes commented 5 months ago

Clang doesn't have this problem because even if it accepts #ident, the only thing it does with the string is throw it away. It supports #ident for compatibility with GCC, but it doesn't keep it in the compiled binary.

Edit: That clang behavior has even been the source of some problems for systems that depended on the ident string being in the compiled object: https://stackoverflow.com/questions/71315990/clang-compiler-and-ident-preprocessor-directive/71316153#71316153

paranormalized-encryptid commented 5 months ago

Thank you, that's very helpful. It sounds like the string doesn't end up in the binary or the object code, which is useful to know since I need to supply both for licensing purposes. Interesting that it causes problems for some people. If nothing else, at least we can choose between compilers. It's a shame the COFF format can't handle .ident strings like how ELF can, but we can only work with what we're given, I guess.