gnudatalanguage / gdl

GDL - GNU Data Language
GNU General Public License v2.0
277 stars 61 forks source link

Test failures on Arch Linux #1907

Open jkohnert opened 2 weeks ago

jkohnert commented 2 weeks ago

Hi,

I'm currently trying to update the package for Arch Linux (as I'm the maintainer in the aur).

However, there are failing tests:

159/214 Test #159: test_postscript.pro ................***Failed    1.24 sec
[kohni-mobil:31526] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kohni-mobil.1000/jf.0/2996699136/shared_mem_cuda_pool.kohni-mobil could be created.
[kohni-mobil:31526] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
% Compiled module: TEST_POSTSCRIPT.
% Compiled module: GDL_IDL_FL.
% TEST_POSTSCRIPT_BASIC: writing file : GDL_ps_basic_1.ps

*** PLPLOT WARNING ***
Unable to open cmap0 file cmap0_default.pal

*** PLPLOT WARNING ***
Unable to open cmap0 file cmap0_default.pal

*** PLPLOT WARNING ***
Unable to open cmap1 .pal file cmap1_default.pal

*** PLPLOT WARNING ***
Unable to open cmap0 file cmap0_default.pal

*** PLPLOT ERROR, IMMEDIATE EXIT ***
Unable to either (1) open/find or (2) allocate memory for the font file
Program aborted

All failing tests mention this error, and as far as I could find, it could to be related to OpenMPI, but I'm not yet sure.

I checked, there is enough space in /tmp (16GiB); so space-related problems do not seem to be the culprit. However, Arch currently has OpenMPI 5.0.5.

The failing tests are:

Total Test time (real) = 313.53 sec

The following tests did not run:
         21 - test_bug_3055720.pro (Skipped)
         22 - test_bug_3057511.pro (Skipped)
         23 - test_bug_3057520.pro (Skipped)
         24 - test_bug_3061072.pro (Skipped)
         30 - test_bug_3100945.pro (Skipped)
        145 - test_netcdf.pro (Skipped)

The following tests FAILED:
          2 - test_all_basic_functions.pro (Failed)
         15 - test_bug_2610174.pro (Failed)
         31 - test_bug_3104214.pro (Failed)
         48 - test_bug_3394430.pro (Failed)
         54 - test_bug_3595172.pro (Failed)
         74 - test_clip.pro (Failed)
        110 - test_grib.pro (Failed)
        112 - test_hdf5.pro (Failed)
        157 - test_pmulti.pro (Failed)
        159 - test_postscript.pro (Failed)
Errors while running CTest

Does anyone have an idea how to track down the problem?

Best, Jan

jkohnert commented 2 weeks ago

Here is a full build log of my pipeline. The build container is base on the current arch docker-image along with some enhancements for building also aur-related stuff.

GillesDuvert commented 1 week ago

@jkohnert interesting this shows a difficulty using shmap when in kubernetes (file /tmp/ompi.runner-9kk4tsne-project-16-concurrent-0-51i5q2wa.1000/jf.0/1038286848/shared_mem_cuda_pool.runner-9kk4tsne-project-16-concurrent-0-51i5q2wa not found) . Plus the fact that some .pal files are not found either. May be a filename max length problem?

GillesDuvert commented 1 week ago

'Normally' the dependency to the plplot library has been removed in gdl 1.1 and it rebuilds on the github CI environment without it AFAIK. ... but it looks like you need it (.pal) files. I mean, gdl comes with its .pal files so 'normally' again it should find them. At least there are no more references to 'plplot' in any of our build and install files. This is very unsatisfactory but you may try to install plplot and retry. This will not solve the strange 'shmap' problem though.

jkohnert commented 1 week ago

Hey @GillesDuvert since I get the same error on my local maschine (even with plplot still being available due to version 1.0.6 being installed), this seems to be a bit more complicated, at least to me.

I'm not really sure on how to debug it, though. I'll try running the built application in a debugger calling one of the failing .pro-files as soon as I have some spare time available.

Best, Jan

jkohnert commented 5 days ago

Short Update: I just build without OpenMPI locally, but the tests still fail. The warning regarding shmem (expectedly) disappears, but the *.pal-files are still not found.

Debugging reveals, that the files are sarched in "DATA_DIR" which is set to "/usr/share/gnudatalanguage" (due to install prefix given). The other options in plLibOpenPdfstrm() fail. But since I'm testing an uninstalled version, the open call to the file obviously fails and I get the error above. So I'd probably have to set an env-variable telling the local plplot version where to look for those files. I'll do some more analysis.

Thanks again for the input. :)

jkohnert commented 5 days ago

Next update: Running PLPLOT_LIB=/home/jankoh/projects/gdl/src/plplot/data/ CTEST_OUTPUT_ON_FAILURE=1 make test works just fine (given the fact the source on my local maschine is in /home/jankoh/projects/gdl/. I can live with that for the moment and make the Arch-build work to be able to issue the update.

Anyway, we should probably make this work without the need of such a quite ugly hack. I'll try to make a PR.

Having said that, there's plInBuildTree() in plplots code, probably just to make stuff working when running stuff inside the build tree. However, the lines

char currdir[PLPLOT_MAX_PATH], *pcurrdir = currdir;
char builddir[PLPLOT_MAX_PATH], *pbuilddir = builddir;

in there really look suspicious to me. If I understand them correctly, they define the char arrays currdir and builddir, and additionally the pointers pcurrdir and pbuilddir, and set them to undefined memory contents. They could be rewritten to make it a bit clearer:

char currdir[PLPLOT_MAX_PATH];
char *pcurrdir = currdir;

This doesn't segfault, since the array is defined first (and doesn't need extra allocation). But we cannot say what will be the content of the variable, since it depends on what was in the array's memory part beforehand. It's just random garbage. I currently highly doubt, this code has ever worked at all. But since my C is a bit rusty, I might be wrong on that.

Best, Jan

GillesDuvert commented 3 days ago

Dear Jan I'm sorry I cannot be more of help right now, but of course will come back to you soon. .pal files are indeed to be found in "DATA_DIR" which is apparently set to "/usr/share/gnudatalanguage" for you. If no full installation has been proceeded, this directory may not contain the .pal files. If not, there may be a (subtle?) cmake problem (we are by far not cmake experts as a quick glance to our cmake files will immediately show!). This would be the case for other needed files, too ( especially the .pro file in the "/usr/share/gnudatalanguage/lib" directory must be made up-to-date)