iains / gcc-11-branch

GCC 11 for Darwin with experimental Arm64 support. Current release 11.5-darwin-r0 [July 2024]
GNU General Public License v2.0
2 stars 2 forks source link

Issue using gcc-11-branch compiling HDF5 1.10.8 #3

Closed mathomp4 closed 2 years ago

mathomp4 commented 2 years ago

Thanks to @iains and #2 being solved, I continued on and built MPI and then tried to build my set of base libraries. In doing so, my build of HDF5 1.10.8 (yes, I know, old...)

make[4]: Entering directory '/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/hl/test'
  CCLD     test_lite
ld: warning: option -s is obsolete and being ignored
ld: in ../../hl/src/.libs/libhdf5_hl.a(H5LTparse.o), in section __TEXT,__text reloc 347: symbol index out of range for architecture arm64
collect2: error: ld returned 1 exit status
make[4]: *** [Makefile:983: test_lite] Error 1

The thing is, the Homebrew M1 GCC 11.2 from @fxcoudert I built before (the one that had issues) was happy to build this HDF5, so this must be something new. The error is a bit beyond my ken. :)

iains commented 2 years ago

there's not enough information for me to be able to debug this :)

The error message has the smell of a malformed object file - __TEXT,__text reloc 347: symbol index out of range for architecture I would think that symbol indices are set by the assembler - we do not emit those from the compiler [of course, we do emit the symbols and therefore the number and ordering could be relevant].. Also, I might eb misinterpreting the error message ...

initial questions;

I might need the actual objects, or .i files that were used to generate them - so isolating the command lines that gave rise to the objects complained about is worth doing (those can then be manually re-issued with -save-temps -v to obtain the intermediate outputs.

iains commented 2 years ago

I cannot cross-compile HDF5: https://portal.hdfgroup.org/pages/viewpage.action?pageId=48808266 so I am not going to be able to reproduce this locally.

mathomp4 commented 2 years ago

@iains I'll try to get to this tomorrow. For some reason the VPN on my M1 Macbook has decided not to work. Joy. Tomorrow I'll be at work so the VPN can be avoided. Ahh...fun with new chip architectures!

mathomp4 commented 2 years ago

@iains Might be next week now. My M1 MacBook is not liking something extra NASA put on it. It...doesn't want to see the internet. I am a bit baffled and I'll need to find a sysadmin to figure it out.

Fun with new chip architectures!

fxcoudert commented 2 years ago

I can only confirm that I can't reproduce that with the latest hdf5 (version 1.12.2). It build fine and all tests pass.

iains commented 2 years ago

OK. To be honest, there's no too much rush from my side (tomorrow is 9.5 RC and then around a week until 10.4 rc).. so a lot to do there... .... plus last night there was a cloud-to-cloud lighting arc right over my house (a very loud bang, coincident with the flash) which took out my internet (and cost most of today to getting back online) .. I guess the potentials on the overhead wires were too much for the modem :(.

good to know latest is OK at least (that might help narrow things down).

mathomp4 commented 2 years ago

I can only confirm that I can't reproduce that with the latest hdf5 (version 1.12.2). It build fine and all tests pass.

@fxcoudert Interesting! If can somehow get internet enough to pull that tag/version of HDF5, I can test that out. I have been looking for a good excuse to move our code to HDF5 1.12, and this might be it. (We tend to be conservative with some libraries since operational code has a good reason to be conservative!)

mathomp4 commented 2 years ago

Well, I was able to get enough Internet to pull HDF 1.12.2 and I still get:

Making install in test
make[4]: Entering directory '/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/hl/test'
  CCLD     test_lite
ld: warning: option -s is obsolete and being ignored
ld: in ../../hl/src/.libs/libhdf5_hl.a(H5LTparse.o), in section __TEXT,__text reloc 347: symbol index out of range for architecture arm64
collect2: error: ld returned 1 exit status
make[4]: *** [Makefile:988: test_lite] Error 1
make[4]: Leaving directory '/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/hl/test'

I get this error with both a serial (CC=gcc) and parallel (CC=mpicc) build of HDF5.

Per a request of @iains, here is the verbose make output for the serial case:

make[2]: Entering directory '/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/hl/test'
/bin/sh ../../libtool  --tag=CC   --mode=link /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/bin/gcc -std=c99  -Wall -Wcast-qual -Wconversion -Wextra -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-include-dirs -Wshadow -Wundef -Wwrite-strings -pedantic -Wno-c++-compat -Wlarger-than=2560 -Wlogical-op -Wframe-larger-than=16384 -Wpacked-bitfield-compat -Wsync-nand -Wstrict-overflow=5 -Wno-unsuffixed-float-constants -Wdouble-promotion -Wtrampolines -Wstack-usage=8192 -Wmaybe-uninitialized -Wdate-time -Warray-bounds=2 -Wc99-c11-compat -Wduplicated-cond -Whsa -Wnormalized -Wnull-dereference -Wunused-const-variable -Walloca -Walloc-zero -Wduplicated-branches -Wformat-overflow=2 -Wformat-truncation=1 -Wrestrict -Wattribute-alias -Wcast-align=strict -Wshift-overflow=2 -Wattribute-alias=2 -Wmissing-profile -Wc11-c2x-compat -fstdarg-opt -fdiagnostics-urls=never -fno-diagnostics-color -s  -Wbad-function-cast -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpacked -Wpointer-sign -Wpointer-to-int-cast -Wint-to-pointer-cast -Wredundant-decls -Wstrict-prototypes -Wswitch -Wunused-function -Wunused-variable -Wunused-parameter -Wcast-align -Wunused-but-set-variable -Wformat -Wincompatible-pointer-types -Wint-conversion -Wshadow -Wcast-function-type -Wmaybe-uninitialized -Wno-aggregate-return -Wno-inline -Wno-missing-format-attribute -Wno-missing-noreturn -Wno-overlength-strings -Wno-jump-misses-init -Wno-suggest-attribute=const -Wno-suggest-attribute=noreturn -Wno-suggest-attribute=pure -Wno-suggest-attribute=format -Wno-suggest-attribute=cold -Wno-suggest-attribute=malloc -O3  -L/Users/mathomp4/installed/MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib -L/Users/mathomp4/installed/MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib  -lm -o test_lite test_lite.o ../../hl/src/libhdf5_hl.la ../../test/libh5test.la ../../src/libhdf5.la -lsz -lz -ldl -lm
libtool: link: /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/bin/gcc -std=c99 -Wall -Wcast-qual -Wconversion -Wextra -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-include-dirs -Wshadow -Wundef -Wwrite-strings -pedantic -Wno-c++-compat -Wlarger-than=2560 -Wlogical-op -Wframe-larger-than=16384 -Wpacked-bitfield-compat -Wsync-nand -Wstrict-overflow=5 -Wno-unsuffixed-float-constants -Wdouble-promotion -Wtrampolines -Wstack-usage=8192 -Wmaybe-uninitialized -Wdate-time -Warray-bounds=2 -Wc99-c11-compat -Wduplicated-cond -Whsa -Wnormalized -Wnull-dereference -Wunused-const-variable -Walloca -Walloc-zero -Wduplicated-branches -Wformat-overflow=2 -Wformat-truncation=1 -Wrestrict -Wattribute-alias -Wcast-align=strict -Wshift-overflow=2 -Wattribute-alias=2 -Wmissing-profile -Wc11-c2x-compat -fstdarg-opt -fdiagnostics-urls=never -fno-diagnostics-color -s -Wbad-function-cast -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpacked -Wpointer-sign -Wpointer-to-int-cast -Wint-to-pointer-cast -Wredundant-decls -Wstrict-prototypes -Wswitch -Wunused-function -Wunused-variable -Wunused-parameter -Wcast-align -Wunused-but-set-variable -Wformat -Wincompatible-pointer-types -Wint-conversion -Wshadow -Wcast-function-type -Wmaybe-uninitialized -Wno-aggregate-return -Wno-inline -Wno-missing-format-attribute -Wno-missing-noreturn -Wno-overlength-strings -Wno-jump-misses-init -Wno-suggest-attribute=const -Wno-suggest-attribute=noreturn -Wno-suggest-attribute=pure -Wno-suggest-attribute=format -Wno-suggest-attribute=cold -Wno-suggest-attribute=malloc -O3 -o test_lite test_lite.o  -L/Users/mathomp4/installed/MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib ../../hl/src/.libs/libhdf5_hl.a /Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/src/.libs/libhdf5.a ../../test/.libs/libh5test.a ../../src/.libs/libhdf5.a /Users/mathomp4/installed/MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib/libsz.a -lz -ldl -lm
ld: warning: option -s is obsolete and being ignored
ld: in ../../hl/src/.libs/libhdf5_hl.a(H5LTparse.o), in section __TEXT,__text reloc 347: symbol index out of range for architecture arm64
collect2: error: ld returned 1 exit status

I can try to do more if you can tell what you'd like.

I wonder if perhaps @fxcoudert and myself built GCC differently? I built using:

❯ clang --version
Apple clang version 13.1.6 (clang-1316.0.21.2.3)
Target: arm64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

and my configure string was:

../gcc-11-branch-gcc-11.3-darwin-r0/configure \
   --prefix=$HOME/installed/Core/gcc-gfortran/11.3.0 \
   --enable-languages=c,c++,fortran \
   --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk |& tee configure.log

But then this is the exact same configure string and clang version I built my failed 11.2 version with...and that built HDF5 just fine! Hmm...

iains commented 2 years ago

Per a request of @iains, here is the verbose make output for the serial case:

<snip>
ld: warning: option -s is obsolete and being ignored
ld: in ../../hl/src/.libs/libhdf5_hl.a(H5LTparse.o), in section __TEXT,__text reloc 347: symbol index out of range for architecture arm64
collect2: error: ld returned 1 exit status

I can try to do more if you can tell what you'd like.

It is complaining about H5LTparse.o so I'd like to see the compile line that builds that object. Then, taking that compile line, append -save-temps -v and post the .i file, please?

mathomp4 commented 2 years ago

@iains Okay. I think I have it. I'm attaching H5LTparse.i as well as the make log that made it. (Well, I'm attaching H5LTparse.i.txt so that GitHub is happy with it...)

H5LTparse.i.txt make_h5ltparse.log

mathomp4 commented 2 years ago

Oh, and here is the make log but with V=1 make_h5ltparse_V1.log

iains commented 2 years ago

thanks, I wonder if the difference (in experience) is to do with which compiler is used to build 'C' files. @fxcoudert in your build is this file built with GCC or clang?

I have not looked at the .i yet .. it might have to wait until we get 9.5RC out of the way ;)

fxcoudert commented 2 years ago

Homebrew uses clang as C compiler. I'm sorry but I'm also out of time to run more tests, I am battling against a bug in GCC 11 & 12 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105664) and several bugs in Julia build system 😢

iains commented 2 years ago

@iains Okay. I think I have it. I'm attaching H5LTparse.i as well as the make log that made it. (Well, I'm attaching H5LTparse.i.txt so that GitHub is happy with it...)

H5LTparse.i.txt make_h5ltparse.log

Thanks.

I do see one difference here, in that clang defaults to '-fcommon' where GCC now defaults to -fno-common; this means that clang will indirect accesses to some variables via the GOT and GCC will not.

(the transition to no-common as the default has resulted in an amount of fallout - but it is the right technical solution).

iains commented 2 years ago

Other things that might 'work' could be to reduce the optimisation to 'O2' (less loop unrolling) and/or try -Os (smaller object). These are band-aid suggestion (trying to figure out if there's actually a real code-gen problem or just difference in default behaviour).

mathomp4 commented 2 years ago

@iains Okay. I think I have it. I'm attaching H5LTparse.i as well as the make log that made it. (Well, I'm attaching H5LTparse.i.txt so that GitHub is happy with it...) H5LTparse.i.txt make_h5ltparse.log

Thanks.

I do see one difference here, in that clang defaults to '-fcommon' where GCC now defaults to -fno-common; this means that clang will indirect accesses to some variables via the GOT and GCC will not.

  • If the code is part of a shared library, then clang should be using "-fno-common" for the compiles (common is not normally allowed in shared libs).
  • I suppose that if the code is part of an exe - you could try adding '-fcommon' to the GCC command ...

(the transition to no-common as the default has resulted in an amount of fallout - but it is the right technical solution).

@iains I'm not sure how @fxcoudert built HDF5, but at the moment I'm building as a static library (mainly because we have always built as static and...well...don't mess with working builds). My guess is @fxcoudert is probably building as a shared library because, well, that's how most people would nowadays.

When I'm back at work, I'll try and see if I build HDF5 as a shared library on the M1 if it's happier (although, again, HDF5-as-static did work with the bad 11.2 so... 🤷🏼 )

iains commented 2 years ago

-O3 -o test_lite test_lite.o -L/..../lib ../../hl/src/.libs/libhdf5_hl.a

So the construction of the static library is not failing, but when using that library with presumably an internal test (for the package) .. is.

I wonder if we can find out which symbol is failing; see if we get extra info if the link line is re-issued with -Wl,-debug -Wl,-why_load

iains commented 2 years ago

as for why 11.2 worked - the code generated for aarch64 will surely have changed between 11.2 and 11.3 ... if it has grown perhaps something has now become out of range.

If you still have the 11.2 install around, and you could generate the parser .i (and the .s since I will not be able to replicate your setup exactly) .. I could perhaps try to identify what changed (and if that is reasonable - e.g. as the result of a fixed bug).

iains commented 2 years ago

at the moment, what is frustrating is that we have an error reported - but the error is not specific enough to be able to decide where the problem might lie.

iains commented 2 years ago

BTW, on the assumption that @fxcoudert built the C files in this library with clang (which defaults to -fcommon) you could also add that option to your build (as noted above) - that could well have a significant effect - since common symbols are indirected through the GOT which will alter the layout quite a bit.

(we should still find out the actual problem and then decide how to fix it - it could of course be a bug in the compiler)

iains commented 2 years ago

two small updates:

1/ I stripped out all the Wxxx to see what was actually left (and therefore what I'd need to have to replicate this)

... = Users/mathomp4/installed
libtool: link: /.../Core/gcc-gfortran/11.3.0/bin/gcc -std=c99 -s -O3 -o test_lite 
test_lite.o  
-L/.../MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib 
../../hl/src/.libs/libhdf5_hl.a 
/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/src/.libs/libhdf5.a
../../test/.libs/libh5test.a
../../src/.libs/libhdf5.a 
/.../MPI/gcc-gfortran-11.3.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib/libsz.a -lz -ldl -lm

2/ Output of otool -rV | less -N says that _H5LTyychar is the symbol with the problem - and that IS one that would be referenced via the GOT with a -fcommon build.

    347 00000210 False long   True   PAGOF12 False     _H5LTyychar
fxcoudert commented 2 years ago

We have a report of this same bug at Homebrew: https://github.com/orgs/Homebrew/discussions/3296 I've asked the OP there to provide some more information, hopefully we're going to narrow it down.

iains commented 2 years ago

it would be valuable to be able to compare the .s between the working 11.2 and the not working 11.3 - as noted above it is possible that the size of the object has grown for innocent reasons. Of course, we need to figure out what to do about it either way.

fxcoudert commented 2 years ago

@iains a minimal reproducer was posted at https://github.com/orgs/Homebrew/discussions/3296#discussioncomment-2800314 which is Fortran-only (no clang, no C source). I can confirm it reproduces on my machine as well.

iains commented 2 years ago

unfortunately, that means I cannot compare clang's output with GCC's :( do you have a HB 11.2 build somewhere that works with this - so that at least I could compare the asm?

iains commented 2 years ago

however.. the problem has been spotted - we are trying to apply negative offsets to relocations (which should work, but doesn't because of an assembler/linker bug). Will sort out a patch for this ASAP.

mathomp4 commented 2 years ago

however.. the problem has been spotted - we are trying to apply negative offsets to relocations (which should work, but doesn't because of an assembler/linker bug). Will sort out a patch for this ASAP.

Huzzah!

Also, this update is why I am not a compiler developer. I can just about grok the assembly out of godbolt for my Fortran code, but this? Yeah, you all are magicians. 😄

iains commented 2 years ago

would you be willing to test the candidate patches on your wider code-base? (it's a long story, but as things stand I can only test on a cross-compiler, since I do not currently have access to an M1 box).

mathomp4 commented 2 years ago

would you be willing to test the candidate patches on your wider code-base? (it's a long story, but as things stand I can only test on a cross-compiler, since I do not currently have access to an M1 box).

@iains Sure. I can give it a go.

FYI, my usual process is:

  1. Build/Install compiler (I tend to not do a make check with GCC because GCC's check is...comprehensive, let's say, but I probably should try it on M1!)
  2. Build Open MPI
  3. Make sure "Hello world" works
  4. Build Baselibs
  5. Build GEOSgcm
  6. Run GEOSgcm

So, 1-3 have been working, it's step 4 at the moment where I hit the HDF5 thing.

iains commented 2 years ago

OK so, to test the patch you would need to amend the sources for the compiler (apply the patches) and then start from 1 again - if that's still OK then I'll sort out the patches and put them here (with some instructions).

mathomp4 commented 2 years ago

OK so, to test the patch you would need to amend the sources for the compiler (apply the patches) and then start from 1 again - if that's still OK then I'll sort out the patches and put them here (with some instructions).

I can do that (with instructions) :)

iains commented 2 years ago

OK. so the simplest thing is to clone the following branch(es) they contain patches that fix a couple of issues that have arisen since the initial post.

there's one for GCC 11.3r2 (preview of an update to release 2)

https://github.com/iains/gcc-11-branch/tree/gcc-11-3-darwin-pre-r2

there's one for GCC 12.1r1 (preview of an update to release 1)

https://github.com/iains/gcc-12-branch/tree/gcc-12-1-darwin-pre-r1

So there's no patching to do - just checkout and build the branches - if all is successful, then the updated releases will be identical modulo SHA1 hash values.

===== @kencu @fxcoudert - the branches also include a couple of fixes for rpath-related things I've added a configuration options --with-darwin-add-path= that allows you to specify a path that the compiler will automatically add to the embedded run paths.

( let's not hijack this thread for any further discussion of this tho ).

edit: no idea what that wants to be super bold and large, my MD-fu is clearly lacking... second edit .. ah, it's the following construct....

=====

mathomp4 commented 2 years ago

@iains I've grabbed the 11.3 branch now and am building. If all goes well up my stack, I'll try to do the same with the 12.1 branch.

mathomp4 commented 2 years ago

Brief update: I've been able to build HDF5! I'm going through the rest of my stack now. But looks promising! 👍🏼

mathomp4 commented 2 years ago

Further update: All of my Baselibs built (so netCDF-C, netCDF-Fortran, ESMF...) and so did GEOS, my climate model.

And even better, GEOS ran. For 6 hours at with debug options at stupid low resolution with a lot of stuff turned off because "small laptop", but it worked. I'm doing more tests now with 11.3 to make sure our usual pattern of work will all be okay, but I think you nailed it.

Once I'm done with 11.3 testing, I'll grab your 12.1 branch and do the same with it.

But I think you might have got it, @iains !

PS: Should I be using --with-darwin-add-path= in my builds? Or is this something a bit fancy that I don't need to worry about?

iains commented 2 years ago

Further update: All of my Baselibs built (so netCDF-C, netCDF-Fortran, ESMF...) and so did GEOS, my climate model.

And even better, GEOS ran. For 6 hours at with debug options at stupid low resolution with a lot of stuff turned off because "small laptop", but it worked. I'm doing more tests now with 11.3 to make sure our usual pattern of work will all be okay, but I think you nailed it.

excellent!

Once I'm done with 11.3 testing, I'll grab your 12.1 branch and do the same with it.

thanks a lot for the testing!

PS: Should I be using --with-darwin-add-path= in my builds? Or is this something a bit fancy that I don't need to worry about?

That's an option really intended for distributions (e.g. macports, home-brew) - so not needed for stand-alone builds unless you're doing something extra fancy. (with very few exceptions) My usual rule-of-thumb about configure options, is "don't use them unless you know why you want to and what for" ...

mathomp4 commented 2 years ago

That's an option really intended for distributions (e.g. macports, home-brew) - so not needed for stand-alone builds unless you're doing something extra fancy. (with very few exceptions) My usual rule-of-thumb about configure options, is "don't use them unless you know why you want to and what for" ...

Yeah, that's my theory as well. But with macOS sometimes you need a fancy option...and then I go steal it from @fxcoudert's brew recipe 😄

iains commented 2 years ago

fixed with Release 2 of GCC 11.3