iains / gcc-darwin-arm64

GCC master branch for Darwin with experimental support for Arm64. Currently GCC-15.0.0 [September 2024]
GNU General Public License v2.0
268 stars 33 forks source link

Maybe issue with MATMUL and -fexternal-blas using Accelerate framework #110

Open franke-biosaxs opened 1 year ago

franke-biosaxs commented 1 year ago

I am on Ventura 13.4 using M1, with FX's 12.2 gfortran binaries installed. Largely things seem to work fine, up to one issue that is driving me mental: non-trivial use of MATMUL with -fexternal-blas -framework Accelerate. have a tendency to end up in:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=257, address=0x3)
    frame #0: 0x0000000000000003
error: memory read failed for 0x0
Target 0: (bunch) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=257, address=0x3)
  * frame #0: 0x0000000000000003
    frame #1: 0x00000001579a53b8 libgfortran.5.dylib`_gfortran_matmul_r8 + 9736
[...]

Using the libgfortran variants of MATMUL works fine (albeit presumably slower). So do the MATMUL in question on Intel Macs and Linux with BLAS from Intel MKL. That said, this problem does not appear for all MATMUL calls. Anecdotally with one argument being an approximation of the unity matrix (think rotations with near-or-exactly zero Euler angles). Needless to say that any trivial test program I tried to create to isolate the issue works just fine. I have observed that print *, A (A the calculated rotation matrix) just before C=MATMUL(A,B) seems to prevent above crash, though.

I would keep looking for issues on my side, but recently I found the Ventura 13.3 release notes:

The BLAS and LAPACK libraries under the Accelerate framework are now inline with reference version 3.9.1. These new interfaces provide additional functionality and a new ILP64 interface. To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers. For ILP64 interfaces, also define ACCELERATE_LAPACK_ILP64. For Swift projects, specify ACCELERATE_NEW_LAPACK=1 and ACCELERATE_LAPACK_ILP64=1 as preprocessor macros in Xcode build settings. (105572917)

Hence I wonder if the Accelerate libraries may have changed in significant enough ways that the gfortran binaries are incompatible somehow (gfortran November 2022, Ventura 13.3 from March 2023)?

I would appreciate any comments or insights on what might be going on. Thank you!

P.S. If there is anything a volunteer fluent in Fortran, but no experience in ARM, can do to help the project to get included into mainline gcc, let me know.

iains commented 1 year ago

apropos the bug:

It would be very helpful if possible to persevere with getting a reproducer - I will take a look at the _gfortran_matmul_r8 code and see if there's anything in the "known issues" that could explain it.

In the absence of a repeatable behaviour, one might look for other explanations - e.g. some uninitialised or random effect that happens to manifest on arm64 + accelerate .. but not elsewhere.

Unless your gfortran has been rebuilt since the new inlining cam e into force, I'd not expect that to make any difference ... but IDK when this change occurred.

apropos upstreaming: its down to me finding time.. or a client that wants to pay for it.

franke-biosaxs commented 1 year ago

HI Ian. Thanks for your quick reply. I'm aware that as-is, this is barely a useful report. I was hoping for a "ah, yes, seen that, it is about XYZ" kind of solution. As I've got a free afternoon, I will try to cut down one of the non-trivial cases. The only "good" thing here is that if it happens, it happens every time. Nothing random about that part. Will come back in a while.

franke-biosaxs commented 1 year ago

Got it. It has to be two compilation units, I couldn't make it happen in one. And secondly, the option -Og is required. Files here.

The test case does not crash if either (1) -fexternal-blas is omitted or (2) -Og is omitted from FCFLAGS.

% bash -x build.sh
+ rm -f a.o b.o testcase
+ FCFLAGS='-fbacktrace -fcheck=all -g -Og -fexternal-blas'
+ gfortran -fbacktrace -fcheck=all -g -Og -fexternal-blas -c b.f90 -o b.o
+ gfortran -fbacktrace -fcheck=all -g -Og -fexternal-blas -c a.for -o a.o
+ gfortran a.o b.o -o testcase -framework Accelerate
+ ./testcase

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x104d3a677
#1  0x104d39653
#2  0x19bc22a23
#3  0x104dbd3b7
build.sh: line 10: 29699 Segmentation fault: 11  ./testcase
% lldb testcase 
(lldb) target create "testcase"
Current executable set to '/Users/franke/git/atsas-testsuite-branch/build/atsas/bunch/testcase' (arm64).
(lldb) run
Process 29703 launched: '/Users/franke/git/atsas-testsuite-branch/build/atsas/bunch/testcase' (arm64)
Process 29703 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000000000008
error: memory read failed for 0x0
Target 0: (testcase) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000000000008
    frame #1: 0x00000001003c13b8 libgfortran.5.dylib`_gfortran_matmul_r8 + 9736
    frame #2: 0x0000000100003b78 testcase`MAIN__ at a.for:13:72
    frame #3: 0x0000000100003d88 testcase`main at a.for:2:9
    frame #4: 0x000000019b89bf28 dyld`start + 2236
iains commented 1 year ago

@fxcoudert - have you seen any other reports and/or do you have any comments on this?

fxcoudert commented 1 year ago

The backtrace is as follows:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x0000000000000008
    frame #1: 0x00000001003c6bf4 libgfortran.5.dylib`_gfortran_matmul_r8 + 6564
    frame #2: 0x0000000100003b78 testcase`MAIN__ at a.for:13:72
    frame #3: 0x0000000100003d88 testcase`main at a.for:2:9
    frame #4: 0x0000000191667f28 dyld`start + 2236

I can make a further reduced test case:

$ cat b.f90 
program testcase
  implicit none
  real :: rotmat(3,3), xyz(3, 3)
  xyz = 0
  rotmat = 0
  print *, matmul(rotmat, xyz)
end program
$ gfortran -fexternal-blas -Og b.f90 -framework Accelerate && ./a.out
zsh: segmentation fault  ./a.out

and there the backtrace is the same:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x16fdff0a0)
  * frame #0: 0x000000016fdff0a0
    frame #1: 0x000000010029c4c8 libgfortran.5.dylib`_gfortran_matmul_r4 + 6936
    frame #2: 0x0000000100003e6c a.out`MAIN__ at b.f90:6:30
    frame #3: 0x0000000100003ef4 a.out`main at b.f90:7:11
    frame #4: 0x0000000191667f28 dyld`start + 2236

We must be doing something wrong/weird with Accelerate, but I don't know what. I cannot reproduce when linking against openblas, though…

iains commented 1 year ago

have any of the interfaces changed? ... or is this a case where our D.3 matters (although, I think it used to work, Dominique gave me a benchmark code early in the devt which contrasted external / internal perf.).

edit: it might also help to enumerate the optimisations in effect with "Og" and see if we can pin down which is causing the issue.

fxcoudert commented 1 year ago

BLAS prototypes pass everything as pointer, so I don't think D.3 matters:

typedef void (*blas_call)(const char *, const char *, const int *, const int *,
                          const int *, const GFC_REAL_4 *, const GFC_REAL_4 *,
                          const int *, const GFC_REAL_4 *, const int *,
                          const GFC_REAL_4 *, GFC_REAL_4 *, const int *,
                          int, int);

Wait. Those two int at the end are weird. I mean, they are unused arguments (they are placeholders for the length of the Fortran strings passed as char *, whose length is known to be 1), but… Apple's prototype doesn't have those. It has:

int sgemm_(char *transa, char *transb, int *m, int *n, int *k,
           float *alpha, float *a, int *lda, float *b, int *ldb,
           float *beta, float *c__, int *ldc)

I've seen that before, and on most arches, this does not actually create any trouble. Could it be that on the aarch64-darwin ABI it creates an issue?

Edit: nope, I tried changing that, and it does not make the bug go away.

fxcoudert commented 1 year ago

Probably the next step is to make the Fortran code directly call sgemm from Accelerate, and see if that fails. If so, report it to Apple, because then it's clearly an Accelerate bug.

iains commented 1 year ago

OK - but there is some dependency on the optimisation, correct? if we lost (or changed) the prototype somehow then that could cause issues, I suppose - since un-named parameters are passed differently from named ones.