mreineck / ducc

Fork of https://gitlab.mpcdf.mpg.de/mtr/ducc to simplify external contributions
GNU General Public License v2.0
13 stars 12 forks source link

Some tests are failing on aarch64 and s390x architecture on Fedora #3

Closed mattiaverga closed 1 year ago

mattiaverga commented 2 years ago

Several test_nufft_1d are failing on i686 architecture, for example:

>       dirty2 = ducc0.nufft.nu2u(points=ms, coord=uvw, forward=forward, epsilon=epsilon, nthreads=nthreads, out=dirty2).astype("c16")
E       RuntimeError: 
E       ./src/ducc0/nufft/nufft.h: 586 (ducc0::detail_nufft::Params1d<Tcalc, Tacc, Tms, Timg, Tcoord>::Params1d(const ducc0::detail_mav::cmav<Tcoord, 2>&, const ducc0::detail_mav::cmav<std::complex<Tms>, 1>&, ducc0::detail_mav::vmav<std::complex<Tms>, 1>&, const ducc0::detail_mav::cmav<std::complex<Timg>, 1>&, ducc0::detail_mav::vmav<std::complex<Timg>, 1>&, double, bool, size_t, size_t, double, double) [with Tcalc = double; Tacc = double; Tms = double; Timg = double; Tcoord = double; size_t = unsigned int]):
E       
E       Assertion failure
E       nu too large

See https://kojipkgs.fedoraproject.org//work/tasks/669/90470669/build.log for the full build logs.

I wonder if I can just disable tests on i686 or stop providing ducc0 on Fedora i686? Is ducc0 reliable on that architecture?

mreineck commented 2 years ago

Thank you for reporting this! I think I understand the problem and should have a fix soon.

Unfortunately testing on 32bit is a bit of a hassle for me, so I don't do this very often...

mreineck commented 2 years ago

In fact I think this is already fixed in my current development version; I'll double-check immediately!

QuLogic commented 2 years ago

I wonder if I can just disable tests on i686 or stop providing ducc0 on Fedora i686? Is ducc0 reliable on that architecture?

You can do this without any extra bureaucracy on F37 if you have no dependents: https://fedoraproject.org/wiki/Changes/EncourageI686LeafRemoval

mattiaverga commented 2 years ago

With 0.25.0 tests are passing on i686. However, as QuLogic pointed out, there's small/no interest in continue building i686 packages on Fedora, so I'm going to disable it. Also, I'll probably going to make ducc0 available on EPEL9, since specfile from Fedora should be compatible.

mattiaverga commented 2 years ago

Argh, I've spoken too early: now test_nufft_2d fail on aarch64 and s390x. https://koji.fedoraproject.org/koji/taskinfo?taskID=90752838

mreineck commented 2 years ago

Fascinating ... I don't have access to these platforms, but maybe I can identify the problem from the error messages!

QuLogic commented 1 year ago

@mattiaverga the builds get garbage collected, so you really should post the logs here as now they are gone.

PS, you should open a bug when using ExcludeArch.

mattiaverga commented 1 year ago

Version 0.27 still shows failing tests, the logs are available in a COPR build. I've filed ExcludeArch bug on RH Bugzilla.

In the past, I remember the was the availability of some testing instance internal of the Fedora infrastructure to debug these kind of problems for developers that have no access to those architectures to test their code, @QuLogic do you know anything about it?

QuLogic commented 1 year ago

The test machines are listed here. You should be able to run a mock shell there and do some investigative tests with it.

mreineck commented 1 year ago

From my point of view it's fine to exclude the failing archs for now. I most likely won't have enough spare time to analyze what's going on ... also I'd need to join the Fedor project first to get access to the test machines.

Still, the logs helped me a lot to narrow down the cause of the problem, and I will have a very close look at the code parts in question when I have more time. Hopefully I'll spot and fix the nonportable code at some point.

mreineck commented 1 year ago

I have recently seen (and fixed) problems on i686 platforms which looked very similar to those reported here. With any luck, the next release (0.29) should pass Fedora testing on all platforms. This will probably be released in a few weeks.

mattiaverga commented 1 year ago

No luck: I tried a build from latest snapshot and the tests are still failing for aarch64 and s390x. Tests successful: ppc64le x86_64

Test failures: aarch64 s390x

Both ppc64le and s390x shows some -Wstringop-overflow= warnings during build, but ppc64le test works, so I don't think it is related. In fact, both x86_64 and aarch64 don't show those warnings, but the tests results are different.

mreineck commented 1 year ago

That's a pity ... I'll see if I can track down the stringop warnings, even if they are unrelated. Concerning the failing platforms: do they by any chance have sizeof(int)==8? Perhaps I'm implicitly assuming sizeof(int)==4 somewhere in my bit manipulations; that would of course be great to find and fix!

mreineck commented 1 year ago

The stringop warnings are indeed unrelated; I have seen similar warnings before and managed to silence them, but they are harmless, as far as I can see.

The failures are most likely caused by some type-punning in the 2D NUFFTs that's perhaps too radical. I think I found a way to avoid this, and the extra memcpy should not be too expnsive.

@mattiaverga , could you please try the current HEAD of the ducc0 branch?

mattiaverga commented 1 year ago

@mreineck the tests run on a build of the current HEAD are successful: https://copr.fedorainfracloud.org/coprs/mattia/Astronomy/build/5504194/

I've also managed to "package" a small script I've found online to just output sizeof(int) at build time, I'm not sure it is right:

#include <stdio.h>

int main()
{
    printf("Size of (int) = %lu"
        " bytes\n",
        sizeof(int));

    printf("Size of (int*) = %lu"
        " bytes\n",
        sizeof(int*));

    return 0;
}

all architectures output:

Size of (int) = 4 bytes
Size of (int*) = 8 bytes
mreineck commented 1 year ago

Thanks a lot for testing! This is good news, even though the new implementation is a bit slower than the one before (on the platforms where it worked ;-) I'll stick to the new version then, and see if I can tweak it some more in the future!

mreineck commented 1 year ago

Closing for now. Please reopen whenever new problems turn up!

mreineck commented 1 year ago

Version 0.29 is out. Let's hope the build goes smoothly!

mattiaverga commented 1 year ago

Thanks, yes the build and the tests completed fine on all architectures and I additionally released ducc0 in EPEL9 repositories.