ralna / spral

Sparse Parallel Robust Algorithms Library
https://ralna.github.io/spral/
Other
104 stars 27 forks source link

SSMFE C interface segfaults on Windows #156

Closed jfowkes closed 10 months ago

jfowkes commented 11 months ago

Moving over to meson has enabled us to test on Windows and this has exposed a segfault in the SSMFE C interface:

test:         ssmfet_c
start time:   14:05:58
duration:     0.05s
result:       (exit status 3221225725 or signal 3221225597 SIGinvalid)

Note entirely sure what these strangely large exit statuses mean. @amontoison?

amontoison commented 11 months ago

@jfowkes It tried to find something with a highest warning level in #159 but I found nothing :( Maybe you could try https://github.com/ralna/GALAHAD/pull/108?

jfowkes commented 11 months ago

@amontoison good shout, will see if I can run some sanitisers...

jfowkes commented 11 months ago

@amontoison I'm getting:

FAILED: libspral.dll 
"gfortran" @libspral.dll.rsp
c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: 
cannot find -lasan: No such file or directory

c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: 
cannot find -lubsan: No such file or directory

I guess the sanitizers don't work on windows?

amontoison commented 10 months ago

I checked online and the sanitizers are not working with GCC on Windows. https://stackoverflow.com/questions/55018627/cannot-find-lasan-using-address-sanitizer-in-mingw-in-windows-mingw Maybe we should split ssmfet_c into smaller unit tests to isolate the issue?

jfowkes commented 10 months ago

Good plan! I will have a go next week at splitting up the ssmfet_c tests to try and isolate the issue.

jfowkes commented 10 months ago

@amontoison here is the C main function for the SSMFE test:

int main(void) {

  int errors = 0;
  int err;

  fprintf(stdout, "testing ssmfe_core...\n");
  err = test_core();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "testing ssmfe_expert...\n");
  err = test_expert();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "testing ssmfe...\n");
  err = test_ssmfe();
  errors += err;
  fprintf(stdout, "%d errors\n", err);

  fprintf(stdout, "=============================\n");
  fprintf(stdout, "Total number of errors = %d\n", errors);

  return errors;
}

Why are we not seeing the first print line (testing ssmfe_core...) being printed in the logs on Windows? Is this because the test errors out before even getting to this line?

amontoison commented 10 months ago

Can you comment the first test with test_core()? I suspect that test_core() failed and the value err is never defined inside this function.

jfowkes commented 10 months ago

Indeed that appears to be the case, I've just flushed the print statements in main and I get:

----------------------------------- stdout -----------------------------------
testing ssmfe_core...

before it crashes. I will add some more flushes to test_core to try isolate the issue.

jfowkes commented 10 months ago

@amontoison okay I have tracked this issue down to the following VLA allocation in the test_core_z double complex test routine:

double complex X[n][n];        /* eigenvectors storage */

where n=400 so this tries to allocate a 400x400 double complex VLA. So it looks like we're getting a stack overflow, is the Windows stack just really tiny or something??

EDIT: according to my calculations the size of X is only 2.56 MB!

amontoison commented 10 months ago

I checked a little bit online and it seems VLA could not be supported by default without the preprocessing flag __STDC_VLA__.

https://groups.google.com/g/comp.std.c/c/AoB6LFHcd88

jfowkes commented 10 months ago

I think it's actually the other way around: https://stackoverflow.com/questions/66246821/what-is-the-motivation-behind-stdc-negative-definitions-for-example-stdc-no-v

jfowkes commented 10 months ago

So what you're saying is that on Windows MinGW defines __STDC_NO_VLA__? I find that hard to believe...

amontoison commented 10 months ago

VLAs are not supported by MSVC so it could explain that gcc on Windows defines it.

amontoison commented 10 months ago

Is it not possible to remove VLAs? https://en.m.wikipedia.org/wiki/Variable-length_array

jfowkes commented 10 months ago

I don't think VLAs are the issue, since the test_core_d double test routine has:

double X[n][n];                /* eigenvectors storage */

and this passes the test on Windows (see the log in #162 )!!