owlbarn / owl

Owl - OCaml Scientific Computing @ https://ocaml.xyz
MIT License
1.2k stars 120 forks source link

`libgfortran.so` not linked and causing compilation error #635

Open jcreinhold opened 1 year ago

jcreinhold commented 1 year ago

Thank you very much for the great work with this package.

I'm trying to build this package on a CentOS7 machine with gcc v11. I have an OpenBLAS installation compiled from source and a corresponding shared library object whose directory is linked to in the OWL_FLAGS which I have changed and provided below (with some of the directories replaces with ... for space).

However, I'm receiving the following error:

#=== ERROR while compiling owl.1.1 ============================================#
# context              2.1.3 | linux/x86_64 | ocaml-option-flambda.1 ocaml-option-fp.1 ocaml-option-nnp.1 ocaml-variants.4.14.0+options | file:///.../opam-repository#25bba47d
# path                 .../.opam-switch/build/owl.1.1
# command              .../bin/dune build -p owl -j 79
# exit-code            1
# env-file             .../log/owl-16623-12e153.env
# output-file          .../log/owl-16623-12e153.out
### output ###
# [...]
# -> stdout:
# -> stderr:
#  | .../bin/ld: warning: libgfortran.so.5, needed by .../libopenblas.so, not found (try using -rpath or -rpath-link)
#  | .../libopenblas.so: undefined reference to `_gfortran_etime@GFORTRAN_8'
#  | .../libopenblas.so: undefined reference to `_gfortran_concat_string@GFORTRAN_8'
#  | collect2: error: ld returned 1 exit status
# Fatal error: exception Failure("Unable to link against openblas.")
# Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
# Called from Dune__exe__Configure.(fun) in file "src/owl/config/configure.ml", line 223, characters 8-51
# Called from Configurator__V1.main in file "otherlibs/configurator/src/v1.ml", line 734, characters 4-7
# Re-raised at Configurator__V1.main in file "otherlibs/configurator/src/v1.ml", line 742, characters 11-42
# Called from Dune__exe__Configure in file "src/owl/config/configure.ml", line 186, characters 2-1023

Here are the OWL_FLAGS:

OWL_CFLAGS='-g -O3 -Ofast -mfpmath=sse -funroll-loops -ffast-math -DSFMT_MEXP=19937 -msse2 -fno-strict-aliasing -Wno-tautological-constant-out-of-range-compare -pipe -Wl,-z,relro -gdwarf-4 -gstrict-dwarf -march=core-avx2 -mtune=skylake -nostdinc -fdebug-prefix-map=.../supercaml -isystem/.../include -fPIC -L/.../OpenBLAS/0.3.20/.../lib -Wl,-rpath=/.../OpenBLAS/0.3.20/.../lib -L/h.../libgcc/11.x/.../lib -Wl,-rpath=/.../libgcc/11.x/.../lib -L/.../glibc/2.34/.../lib -Wl,-rpath-link,/.../glibc/2.34/.../lib -Wl,--hash-style,gnu -Wl,--dynamic-linker=/usr/local/.../lib/ld-linux-x86-64.so.2'

FWIW, I know that libgfortran.so.5 is linked to in the -L and Wl,-rpath= flags containing ligcc in my OWL_FLAGS. I believe I'm using the same gcc and libgfortran that were used to build OpenBLAS.

Admittedly, I'm not an expert properly linking libraries in gcc; however, it seems that this might be remedied by adding -lgfortran as a library like -lopenblas is passed to gcc.

If there's any chance you happen to see an obvious problem in this setup; I'd really appreciate it. Otherwise, perhaps it'd be worth considering adding another environment variable which would add libraries like -lopenblas is added to the gcc call. (I tried to add the -lgfortran as the final flag in OWL_CFLAGS, but it didn't solve the error; however, I believe that might just be because of the position in the gcc call.)

I'd try to change this myself and see if it worked, but the build system I'm working on is strange and I can't easily do so. I also think there's a reasonable chance that I'm doing something obviously wrong with the above setup.

FYI, we also have a static library for OpenBLAS (using -Bstatic on the relevant -Wl flag in OWL_CFLAGS), and I've tried linking to that instead of the shared library; however, that resulted in conf-openblas, whose opam file I modified to use $OWL_CFLAGS instead of $CFLAGS in the build section) failing with different errors (conf-openblas would still try to use the shared library object, but the failure would be related to another library).

jcreinhold commented 1 year ago

I'm suspecting this issue is related to the location of the flags in the call to gcc/lack of -lgfortran as a lib/the inability to use -l:libopenblas.a instead of -lopenblas. Perhaps we could add an OWL_LDLIBS environment variable which would default to -lopenblas or perhaps -lm -lopenblas?

I can fork this repo and try it out, unless you see something obviously wrong with the previous implementation.

jcreinhold commented 1 year ago

For completeness, here is my PKG_CONFIG_PATH. Note that OpenBLAS is present.

PKG_CONFIG_PATH=/.../pcre/8.43/.../lib/pkgconfig:/.../libev/.../lib/pkgconfig:/.../OpenBLAS/0.3.20/.../lib/pkgconfig:/.../gmp/6.1.2/.../lib/pkgconfig:/.../libjpeg/.../lib/pkgconfig:/.../libpng/1.6.37/.../lib/pkgconfig:/.../re2/20190601/.../lib/pkgconfig:/.../sqlite/3.36/.../lib/pkgconfig:/.../zlib/1.2.8/.../lib/pkgconfig:/.../readline/8.0/.../lib/pkgconfig:/.../ncurses/6.1/.../lib/pkgconfig

If I run pkg-config in the build environment, I get the following.

+ pkg-config --cflags openblas
-I/.../OpenBLAS/0.3.20/.../include  
+ pkg-config --libs openblas
-L/.../OpenBLAS/0.3.20/.../lib -lopenblas 

So I believe OpenBLAS and pkg-config are interacting appropriately.

jcreinhold commented 1 year ago

FWIW, I removed OpenBLAS from PKG_CONFIG_PATH to see if it was interfering with the other OWL_CFLAGS; it failed with the same error.

jcreinhold commented 1 year ago

Probably worthwhile to put an OWL_LDFLAGS too so we don't have to jam everything into something named LDLIBS. Perhaps just have OWL_ versions for CPP_FLAGS, LDFLAGS, LDLIBS which would be placed in the right location.

jcreinhold commented 1 year ago

Actually, the error might just be coming from this test where OWL_CFLAGS are ignored. Perhaps this should be moved after cflags are set? Presumably we want to run the test with the same parameters that will be used for the build.

jcreinhold commented 1 year ago

I'm thinking this error might be coming up is because the build script has a set -e in it. I believe C.c_test should just report if the compilation succeeded or failed, not fail (right?). Perhaps the set -e is killing the process.

jcreinhold commented 1 year ago

Disregard that; it's failing from this test which kills the build.

jcreinhold commented 1 year ago

This is solved with a patch akin to #636