ralna / GALAHAD

A library of modern Fortran modules for nonlinear optimization
https://www.galahad.rl.ac.uk
Other
114 stars 15 forks source link

Try to debug DGO issue with Int64 #292

Closed amontoison closed 4 months ago

amontoison commented 4 months ago
 C sparse matrix indexing

 tests options for all-in-one storage format

 DGO solver, problem:  (n = 3)

At line 416 of file ../src/hash/hash.F90
Fortran runtime error: Unit number in I/O statement too large
amontoison commented 4 months ago

@nimgould I was able to connect to the virtual machine and reproduce the error with dgo.

You just need to uncomment the following lines here and after that you can click on the yellow build of macos-13/gcc-v11/Int64. After the compilation, you will see that a ssh command to connect to the CI machine will be print every 5 seconds.

nimgould commented 4 months ago

OK, I am on ... but now what. How do I test an individual package? Am I supposed to use a meson command? I don't know what it/they are ... I'll need to keep editing files, recompiling them and then run the dgo test. Sorry, I need help to proceed. Oh, and I see that the shell has no emacs, so I'll be pretty helpless

nimgould commented 4 months ago

Sorry, Just read the README, now I see how to do this. Still no usable editor, though. And the issue on the macos is to do with ssids, not dgo. Indeed, none of the failures are now for dgo, sheesh, this action system is so maddening!

I've now re-commented the ssh workflow out.

nimgould commented 4 months ago

I am trying to see what is going wrong wth nvfortran. I tried this locally:

CC=nvc CXX=nvc++ FC=nvfortran meson setup builddir/pc64.lnx.nvf_64 -Dc_std=none -Dcpp_std=none -Dgalahad_int64=true meson compile -C builddir/pc64.lnx.nvf_64

... which is ok until

[519/1348] Compiling Fortran object li...on-generated_single_cutest_dummy.f90.o FAILED: libgalahad_single_64.so.p/meson-generated_single_cutest_dummy.f90.o libgalahad_single_64.so.p/galahad_cutest_single_64.mod nvfortran -Ilibgalahad_single_64.so.p -I. -I../.. -Iinclude -I../../include -I../../src/dum/include -I../../src/metis/include -Isrc/ampl -I../../src/ampl -I/usr/lib/x86_64-linux-gnu/openmpi/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/lib -O3 -mp -fPIC -DSINGLE -DDUMMY_MKL_PARDISO -DDUMMY_PARDISO -DDUMMY_PASTIXF -DDUMMY_SPMF -DDUMMY_WSMP -DDUMMY_HSL -DINTEGER_64 -module libgalahad_single_64.so.p -o libgalahad_single_64.so.p/meson-generated_single_cutest_dummy.f90.o -c libgalahad_single_64.so.p/single_cutest_dummy.f90 NVFORTRAN-S-0034-Syntax error at or near / (libgalahad_single_64.so.p/single_cutest_dummy.f90: 644) NVFORTRAN-S-0034-Syntax error at or near / (libgalahad_single_64.so.p/single_cutest_dummy.f90: 646) ....etc

On examing the geneated libgalahad_single_64.so.p/single_cutest_dummy.f90 file, I see on line 644 and onwards that it has inserted the cutest_routines.h header file verbatim, i.e., / \file cutest_routines.h /

/*

/*

Poor old fortran can make no sense of this, and it doesn't happen with other compilers (it leaves the cpp header files alone)

Any ideas?

nimgould commented 3 months ago

I commented out the nvidia tests as the copmpiler clearly has issues and isn't ready for proper deployment; it was unable to resolve generic interfaces in many places (and all the other compilers had no issues)

amontoison commented 3 months ago

Sorry, Just read the README, now I see how to do this. Still no usable editor, though. And the issue on the macos is to do with ssids, not dgo. Indeed, none of the failures are now for dgo, sheesh, this action system is so maddening!

I've now re-commented the ssh workflow out.

I think the best solution is to add multiple print to isolate the issue. But it can wait next week... :)

amontoison commented 3 months ago

@nimgould I wonder if the issue is not just with the WRITE statement in Fortran being platform-dependent. I suspect that the channel can be an integer with 4 or 8 bits only on Linux, while other platforms require a 4-bit integer. It could explain why we have an error at line 416 of hash.F90 (control%out is a 8-bit integer).

nimgould commented 3 months ago

That is possible, I suppose, but then why doesn't the compiler object that the variable is the wrong type for the write function? Moreover, this would be true for all write statements (in both HSL and GALAHAD), and we don't see warnings from any other runs. I will output the varaibles before the write to check

nimgould commented 3 months ago

Ah ha, bug splatted. It was simply that in the C interface, I had commented out the copy of the hash control components from C to fortran, so they took random values!

Of the two remaining failures, both are timeouts. The Windows one looks like it needs a bit more time, not sure about the Mac one, though. I cannot reproduce here, as the same Mac test seems to work

nimgould commented 3 months ago

OK, doubling the timeout cured the Windows issue. Unfortunately, now one of the Ubuntu intel ones is failing (odd that it didn't before, and all that has changed is the timeout!) when testing the Julia. I can see why that might be, and can put in a precaution. The other timeout failure, on the Mac, produces no output from the test (for sbls), so I can't say what is happening.

nimgould commented 3 months ago

"Precaution" works, but now another timeout for the Windows 64bit. Will tihis cycle of inconsistent runtimes ever cease ... I'll double the timeout and try again ...

nimgould commented 3 months ago

I give up ... the more I increase the timeout period, the more runs timeout

nimgould commented 3 months ago

Is there something wrong with these Windows virtual machines? Timeout for nlst_single after 120 seconds, while for the Mac and Ubuntu the run is 0.4 seconds

nimgould commented 3 months ago

And now, not changing a thing, the times dropped to 1 second, and the tests passed. So, only the Mac issue to sort out.

amontoison commented 3 months ago

I give up ... the more I increase the timeout period, the more runs timeout

If we have a timeout, it means that we have an infinite recursion during the test. Are some tests with random values?

nimgould commented 3 months ago

No, this is all deterministic. Times vary considerably during both compilation and runs

nimgould commented 3 months ago

Sometimes it times out, others it doesn't, with a factor of 10 in different times