Closed mkoeppe closed 3 years ago
I'm looking at the Sage build on my Windows machine, and it seems I haven't updated it in some time (I think not since 9.3b2). I already have r-3.6.3 built successfully. But since I last updated it looks like openblas has been updated, so right now openblas is recompiling, then we'll see about r...
As of now, this issue is causing the CI tests of Cygwin to fail so we cannot even see other issues that affect Cygwin.
So let's please get this in to the next beta --- it is trivial to revert in a follow-up ticket if you succeed to fix the issue.
Let me at least see if I can reproduce it locally first... I wonder if there's some way we could make some packages optional on a platform-specific basis (maybe when running bootstrap?)
Replying to @embray:
I wonder if there's some way we could make some packages optional on a platform-specific basis
Yes, with #30383 (needs review!), which adds infrastructure for disabling standard packages.
Replying to @embray:
Let me at least see if I can reproduce it locally first... I wonder if there's some way we could make some packages optional on a platform-specific basis (maybe when running bootstrap?)
Aaaaarghhh...
This would be a documentation nightmare (explaining why a ticket is "optionally optional" is awkward at the very minimum...). It would also "froze" the idea of Windows 10 being a "second-class citizen" as far as Sage is concerned.
I'm starting to consider promoting a WSL2 port as the preffered windows platform,and preparing a deprecation of the Cygwin port, which remains necessary as long as Windows versions <10 remain relevant only as rthe "sanest" way out of this predicament ("sane" being an hyperbole, of course...).
But this also should be discussed on sage-devel
. Hence this post...
online statistics says that among Windows installations, version 10 is approaching 80%. High time to think about WSL2 as a primary platform.
Moreover, the 2nd biggest, qua # of installs, is Windows version 7, on 16%, which is past its EOL, so it's either on very old machines which won't really be capable of running resouce-hungry Sage well, anyway, or on machines having some old, unupgradeable, soft installed. Windows 8 (EOLed since 2016) is at 1% and 8.1, still supported, on 3.7%. https://gs.statcounter.com/windows-version-market-share/desktop/worldwide/
I think we should really put more effort in WSL2, and much less effort in Cygwin.
WSL2 is an installation nightmare for typical users. But this is off-topic.
I'm not even so sure the problem here has to do much with Cygwin. I haven't been able to reproduce it on my local Cygwin install.
Is this build also installing the system R? I have reported problems before with installing the system R resulting in problems linking to the right BLAS.
WSL will just give you a new and different set of problems.
Looking again at the build log where R failed, it looks like it is installing the system openblas and the system R. I wonder if that could somehow be part of the problem.
Have you tried modifying the build script to not install them and see if it works? I know we also want to try getting it working with the system packages, but since they don't work yet anyways maybe we can forego their installation for now.
In the meantime I'll try a fresh cygwin install and see if I can reproduce the environment being used in the CI.
Replying to @embray:
In the meantime I'll try a fresh cygwin install and see if I can reproduce the environment being used in the CI.
Easy to do by using tox -e local-cygwin-choco-standard
in a minimal install of Cygwin + tox
Let's please not continue discussing WSL etc. here on this technical ticket. I have opened #31485 (Meta-ticket: Sage on WSL (Windows Subsystem for Linux)) for it. It would be helpful if someone could organize the discussion and the existing tickets in this direction there.
Replying to @mkoeppe:
Replying to @embray:
In the meantime I'll try a fresh cygwin install and see if I can reproduce the environment being used in the CI.
Easy to do by using
tox -e local-cygwin-choco-standard
in a minimal install of Cygwin + tox
Cool, thanks for the pointer. I don't have chocolatey installed, but I did the equivalent using apt-cyg and the (very nice) sage-get-system-packages
command.
Of course if there's something very specific to the way chocolatey is installing the packages then that's a problem and I'll have to dig deeper. Seems unlikely?
Fresh Cygwin install. I installed all the :standard:
packages recommended by sage-get-system-packages
.
Ran ./bootstrap
then ./configure
with no additional arguments. Then ran make r
.
First it builds and installs OpenBLAS. Several hours later it builds and installs R with no problem.
I am downloading the build artifacts from the GitHub build to see if there are any more detailed logs about the R build there...
Here's one thing that stands out already. On GitHub:
checking for dgemm_ in -lopenblas ... yes
checking whether double complex BLAS can be used... no
On my local build it has:
checking for dgemm_ in -lopenblas ... yes
checking whether double complex BLAS can be used... yes
checking whether the BLAS is complete... yes
The GitHub build says no support for "double complex BLAS" and does not even mention the third check "BLAS is complete"
So the end result is on GitHub R is not linking with openblas:
External libraries: readline, curl
Additional capabilities: PNG, JPEG, TIFF, NLS, ICU
Options enabled: shared R library, shared BLAS, R profiling
Capabilities skipped: cairo
Options not enabled: memory profiling
Whereas on my local build it has:
External libraries: readline, BLAS(OpenBLAS), LAPACK(generic), curl
Additional capabilities: PNG, JPEG, TIFF, NLS, ICU
Options enabled: shared R library, R profiling
Capabilities skipped: cairo
Options not enabled: shared BLAS, memory profiling
And the build error is occurring somewhere related to building R's baseline BLAS which AFAIK I've never tested on Cygwin since it always successfully links with OpenBLAS.
Now the question is just why is the double complex check failing?
Replying to @embray:
Of course if there's something very specific to the way chocolatey is installing the packages then that's a problem and I'll have to dig deeper. Seems unlikely?
Unlikely. Chocolatey really just calls cygwin's setup_....exe
many times.
The openblas logs on GitHub do show successful build an installation, though the make check
fails with a bunch of things like:
gfortran -O2 -Wall -frecursive -fno-optimize-sibling-calls -m64 -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -march=skylake-avx512 -fno-asynchronous-unwind-tables -mavx2 -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib -Wl,-rpath,/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib -o sblat1 sblat1.o ../cygopenblas_skylakexp-r0.3.13.a -lgfortran -lgfortran -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/10 -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../.. -lcygwin
/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/bin/ld: sblat1.o:sblat1.f:(.text+0x67d): undefined reference to `srot_'
sblat1.o:sblat1.f:(.text+0x67d): relocation truncated to fit: R_X86_64_PC32 against undefined symbol `srot_'
/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/bin/ld: sblat1.o:sblat1.f:(.text+0xbae): undefined reference to `srot_'
sblat1.o:sblat1.f:(.text+0xbae): relocation truncated to fit: R_X86_64_PC32 against undefined symbol `srot_'
/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/bin/ld: sblat1.o:sblat1.f:(.text+0xe99): undefined reference to `sdsdot_'
sblat1.o:sblat1.f:(.text+0xe99): relocation truncated to fit: R_X86_64_PC32 against undefined symbol `sdsdot_'
/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/bin/ld: sblat1.o:sblat1.f:(.text+0xfde): undefined reference to `sdot_'
...
But this looks more like a problem in the make check. Why is it trying to link to some ../cygopenblas_skylakexp-r0.3.13.a
when the build produced cygopenblas_haswellp-r0.3.13.a
not skylake?
Probably not related though...
It looks like it produced cygopenblas.dll
just fine:
gcc -O2 -g -march=native -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DNO_AVX512 -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=2 -DMAX_PARALLEL_NUMBER=1 -DUSE_TLS -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.13\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I. -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -DNO_AVX512 -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=2 -DMAX_PARALLEL_NUMBER=1 -DUSE_TLS -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.13\" -msse3 -mssse3 -msse4.1 -mavx -mavx2 -mfma -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib -Wl,-rpath,/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib cygopenblas.def dllinit.o \
-shared -o ../cygopenblas.dll -Wl,--out-implib,../libopenblas.dll.a \
-Wl,--whole-archive ../cygopenblas_haswellp-r0.3.13.a -Wl,--no-whole-archive -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/10 -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/lib/../lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../lib -L/lib/../lib -L/usr/lib/../lib -L/opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../../../x86_64-pc-cygwin/lib -L/usr/lib/gcc/x86_64-pc-cygwin/10/../../.. -lgfortran -lquadmath -lm -lcygwin -lgfortran -lgfortran
Here is the relevant part in the R configure that checks for double complex support in the BLAS lib: https://github.com/wch/r-source/blob/78f3b069d6269ddde40d76515a1eef67df674435/m4/R.m4#L2642
There is a note in there:
dnl Now check if zdotu works (fails on AMD64 with the wrong compiler;
dnl also fails on macOS with Accelerate/vecLib and gfortran;
dnl but in that case we have a work-around using USE_VECLIB_G95FIX)
I don't know what they mean by that. What is "the wrong compiler" in this case?
Perhaps we could just toss in a patch to disable this check and see what happens?
I meant to add, I downloaded the build artifacts from GH and confirmed that the cygopenblas.dll
is in the correct place and also exports the relevant symbols.
Also in the previous check it did successfully link against -lopenblas
and find the dgemm_
symbol, so it's probably not a linking issue, but some problem with compiling and running the test for double complex support which is apparently known to fail sometimes?
Replying to @embray:
But this looks more like a problem in the make check. Why is it trying to link to some
../cygopenblas_skylakexp-r0.3.13.a
when the build producedcygopenblas_haswellp-r0.3.13.a
not skylake?Probably not related though...
Looks like we forgot to update openblas/spkg-check.in
-- it duplicates an old copy of settings made in openblas/spkg-install.in
!!
This should really go through a common script
I tried manually building the conftest.f
from R's configure check and got this warning:
$ gfortran -c conftest.f
conftest.f:15:72:
15 | 10 ztemp = ztemp + zx(i)*zx(i)
| 1
Warning: Fortran 2018 deleted feature: DO termination statement which is not END DO or CONTINUE with label 10 at (1)
And in fact this was fixed upstream just a few months ago: https://github.com/wch/r-source/commit/b2e1bf7ee4a42288b56d77080ad942395b2fe4b0#diff-519bfce994ea1571b861581c2c200f396eb96ad94ba9bc74f20103dc1bda38dc
I applied the upstream fix and the warning went away. Maybe somewhere -Werror
is getting set in FFLAGS
.
I figured out that it's possible to download the failed R build artifacts from GH (very nice, thank you). Unfortunately the config.log
does not say anything very helpful here:
configure:38593: checking whether double complex BLAS can be used
conftestf.f:15:72:
15 | 10 ztemp = ztemp + zx(i)*zx(i)
| 1
Warning: Fortran 2018 deleted feature: DO termination statement which is not END DO or CONTINUE with label 10 at (1)
configure:38674: result: no
Something the way the test is written it's not producing useful output. It looks like the fortran compilation succeeds, with the warning, but only as a warning. Probably the C compilation (or linking?) fails. But it's not producing any output to the log. I still can't reproduce a failure in this locally.
Replying to @mkoeppe:
Replying to @embray:
But this looks more like a problem in the make check. Why is it trying to link to some
../cygopenblas_skylakexp-r0.3.13.a
when the build producedcygopenblas_haswellp-r0.3.13.a
not skylake?Probably not related though...
Looks like we forgot to update
openblas/spkg-check.in
-- it duplicates an old copy of settings made inopenblas/spkg-install.in
!! This should really go through a common script
I rebuilt openblas locally with SAGE_CHECK=yes
and the checks passed.
Replying to @embray:
configure:38593: checking whether double complex BLAS can be used conftestf.f:15:72: 15 | 10 ztemp = ztemp + zx(i)*zx(i) | 1 Warning: Fortran 2018 deleted feature: DO termination statement which is not END DO or CONTINUE with label 10 at (1) configure:38674: result: no
Something the way the test is written it's not producing useful output. It looks like the fortran compilation succeeds, with the warning, but only as a warning. Probably the C compilation (or linking?) fails. But it's not producing any output to the log. I
Probably should run this with CONFIG_SHELL="bash -x"
Looking at the test code, I think it's likely the run test in https://github.com/wch/r-source/blob/78f3b069d6269ddde40d76515a1eef67df674435/m4/R.m4#L2704 that's failing.
Replying to @mkoeppe:
Replying to @embray:
But this looks more like a problem in the make check. Why is it trying to link to some
../cygopenblas_skylakexp-r0.3.13.a
when the build producedcygopenblas_haswellp-r0.3.13.a
not skylake?Probably not related though...
Looks like we forgot to update
openblas/spkg-check.in
-- it duplicates an old copy of settings made inopenblas/spkg-install.in
!! This should really go through a common script
I'll do this in #31490
Replying to @mkoeppe:
Looking at the test code, I think it's likely the run test in https://github.com/wch/r-source/blob/78f3b069d6269ddde40d76515a1eef67df674435/m4/R.m4#L2704 that's failing.
Nice call! After trying to manually compile it with what I think are all the right flags I get:
$ ./conftest.exe
Illegal instruction (core dumped)
Replying to @embray:
Replying to @mkoeppe:
Looking at the test code, I think it's likely the run test in https://github.com/wch/r-source/blob/78f3b069d6269ddde40d76515a1eef67df674435/m4/R.m4#L2704 that's failing.
Nice call! After trying to manually compile it with what I think are all the right flags I get:
$ ./conftest.exe Illegal instruction (core dumped)
This was linking against the OpenBLAS that was included in the build artifacts from GitHub. If I link against my locally built OpenBLAS it works.
So the problem is again OpenBLAS, and apparently the fact that the OpenBLAS from the stage-i-a build artifacts is not portable, and possibly running on different CPUs between jobs?
That's quite possible... Would be great if all of our problems could be fixed by making OpenBLAS properly portable!
Looking at the openblas build logs, I see it is not building with DYNAMIC_ARCH=1
. Does this mean the GH builds are not using SAGE_FAT_BINARY
? Perhaps that would fix all this.
They do use SAGE_FAT_BINARY - see top of ci-cygwin-standard.yml
Replying to @mkoeppe:
They do use SAGE_FAT_BINARY - see top of
ci-cygwin-standard.yml
I see that now.
But looking again at the openblas build log from GH one of the first lines output, from the echo
in spkg-install.in
, is:
Building OpenBLAS: make
This should say:
Building OpenBLAS: make DYNAMIC_ARCH=1
So somewhere along the way the setting is being lost...
Can you check what its value in build/bin/sage-build-env-config
is?
Replying to @embray:
But looking again at the openblas build log [...] This should say:
Building OpenBLAS: make DYNAMIC_ARCH=1
So somewhere along the way the setting is being lost...
You are right, thanks for catching this. I may have a fix for this in https://github.com/mkoeppe/sage/runs/2098475692?check_suite_focus=true
A fix is now in #29537, running in https://github.com/mkoeppe/sage/actions/runs/647547421
Dependencies: #29537, #30383
I've reduced it from "blocker" as I am hopeful that #29537 will fix it. And instead of making the packages "optional", I now propose to keep them as "standard" but disablable.
Branch pushed to git repo; I updated commit sha1. New commits:
49c10c1 | Revert "build/pkgs/{r,rpy2}: Downgrade to optional" |
Branch pushed to git repo; I updated commit sha1. Last 10 new commits:
5141bc3 | build/pkgs/widgetsnbextension: Patch out dependency on notebook (backport from widgetsnbextension-4) |
fee8379 | Merge #31278 |
82d8030 | m4/sage_spkg_enable.m4: Cosmetic change to help strings |
3e92998 | configure --disable-notebook: Fix up what packages are disabled |
74a83fc | build/pkgs/matplotlib/dependencies: Undo removal of tornado |
93d31b9 | Remove use of unicode superscripts in configure --help |
18fbb85 | m4/sage_spkg_collect.m4: Fix description of SAGE_OPTIONAL_CLEANED_PACKAGES |
c3e4093 | Rename SAGE_OPTIONAL_CLEANED_PACKAGES to SAGE_OPTIONAL_UNINSTALLED_PACKAGES |
a67300b | Merge #30383 |
f0a5fb5 | configure.ac: Add option --disable-r |
Description changed:
---
+++
@@ -1,11 +1,14 @@
-https://github.com/mkoeppe/sage/runs/1915093157
+Prompted by a build failure of `r` on `cygwin-standard` in https://github.com/mkoeppe/sage/runs/1915093157
[r-3.6.3] blas.o: in function dsymm_': [r-3.6.3] /opt/sage-1e693c9ae51a53dad0fe099efc41439dc898f02f/var/tmp/sage/build/r-3.6.3/src/src/extra/blas/blas.f:2986:(.text+0x4352): relocation truncated to fit: R_X86_64_PC32 against undefined symbol
xerbla_'
+we make it possible to disable R using a configure options so that one can get a working Sage installation in this way.
-Discussions:
+(The original build failure has hopefully resolved been resolved by #29537.)
+
+Previous tickets and discussions:
- #29486
- #29441
- #29170
There seems to be a simpler bug here that I don't think is addressed yet by any of your fixes. I didn't know there was now an --enable-fat-binary
flag to configure. I'm glad there is, and it makes sense. But perhaps because of that the SAGE_FAT_BINARY
environment variable is being ignored. I added set -x
to the splg-install for openblas and can see that its value is getting unset, which is why the CI configuration wasn't picking it up either despite having it in env
. I think, if set, SAGE_FAT_BINARY
should still override whatever was passed to configure.
Replying to @embray:
There seems to be a simpler bug here that I don't think is addressed yet by any of your fixes.
The simple bug was -- as you correctly suspected -- in my CI script. The testing is done through tox, which does not pass through environment variables (except for those explicitly configured to be passed through).
(see commit 1092d4b in #29537)
Replying to @embray:
I didn't know there was now an
--enable-fat-binary
flag to configure. I'm glad there is, and it makes sense. But perhaps because of that theSAGE_FAT_BINARY
environment variable is being ignored.
This was done in #30375. Both SAGE_FAT_BINARY
and the option --enable-fat-binary
are respected at configure
time and are made persistent for the build through build/bin/sage-build-env-config
. Mixing in the environment variable set at make
time is not done; I don't think this is a problem.
Branch pushed to git repo; I updated commit sha1. New commits:
4916415 | Merge tag '9.3.beta9' into t/30383/new_package_type__optional_enabled_by_default |
f04c134 | Merge branch 't/30383/new_package_type__optional_enabled_by_default' into t/31409/cygwin_standard__r_build_fails_____downgrade_r__rpy2_to_optional |
Replying to @mkoeppe:
Replying to @embray:
I didn't know there was now an
--enable-fat-binary
flag to configure. I'm glad there is, and it makes sense. But perhaps because of that theSAGE_FAT_BINARY
environment variable is being ignored.This was done in #30375. Both
SAGE_FAT_BINARY
and the option--enable-fat-binary
are respected atconfigure
time and are made persistent for the build throughbuild/bin/sage-build-env-config
. Mixing in the environment variable set atmake
time is not done; I don't think this is a problem.
Still, it used to be possible to set it when running make
as well. I guess it's not so important, but it feels like a regression, and was confusing to me.
Prompted by a build failure of
r
oncygwin-standard
in https://github.com/mkoeppe/sage/runs/1915093157we make it possible to disable R using a configure options so that one can get a working Sage installation in this way.
(The original build failure has hopefully resolved been resolved by #29537.)
Previous tickets and discussions:
29486
29441
29170
Depends on #29537 Depends on #30383
CC: @embray @dimpase @orlitzky @jhpalmieri @kiwifb @novoselt @EmmanuelCharpentier @videlec @kliem
Component: porting: Cygwin
Author: Matthias Koeppe
Branch/Commit:
f04c134
Reviewer: François Bissey, John Palmieri
Issue created by migration from https://trac.sagemath.org/ticket/31409