Class unstable wrt to Omega_k computations?

borisbolliet commented 1 year ago

Hi,

I noticed a strange behaviour with the python wrapper of class v.3.2.0 (the calculation via the c-code/ini file seems ok).

When I set Omega_k non-zero, I often get nans.... sometimes yes, sometimes no, for the exact same computation... Am I misinitializing something?

Here is an example of the strange behaviour :

Screenshot 2022-12-14 at 12 57 12

I attach the notebook code below.

Note that this problem seems to be there in class v2.9.4 too. I have checked that the problem is not specific to the particular installation of class as I find this bug on two different platforms.

Boris

import numpy as np
import classy
print(classy.__version__)
from classy import Class
for i in range(10):
    cosmo = Class()
    cosmo.set(
    {

    'output': 'tCl pCl lCl mPk',
    'lensing': 'yes',
    'l_max_scalars': 1000,
    'ln10^{10}A_s': 3.2086614173228347,
     'n_s': 1.1370078740157479,
     'H0': 80.63346456692913,
     'omega_b': 0.024196141732283464,
     'omega_cdm': 0.11212598425196851,
     'Omega_k': -0.2858267716535433,
     'tau_reio': 0.09559055118110235
    }
    )
    cosmo.compute()
    cls = cosmo.lensed_cl(lmax=1000)
    if np.isnan(np.sum(cls['tt'])):
        print('nan at i=',i)

schoeneberg commented 1 year ago

Thank you @borisbolliet for reporting this bug. This is indeed a bit hard to debug! The problem is somewhere in the computation of the PhiL that is performed in tools/hyperspherical.c , which is not being written correctly in this case. I managed to reproduce this issue also with the pure C code, so it's not a wrapper issue. Beyond that, I will try to see where exactly the issue arises, but it's definitely not super simple to debug. Here's the valgrind stacktrace for anyone else interested:

==491435== Conditional jump or move depends on uninitialised value(s)
==491435==    at 0x4A4EB54: __printf_fp_l (printf_fp.c:723)
==491435==    by 0x4A684B9: __vfprintf_internal (vfprintf-internal.c:1687)
==491435==    by 0x4B1F122: __fprintf_chk (fprintf_chk.c:33)
==491435==    by 0x1EFF15: fprintf (stdio2.h:100)
==491435==    by 0x1EFF15: output_one_line_of_cl (output.c:1690)
==491435==    by 0x1F06A6: output_cl (output.c:538)
==491435==    by 0x1F3378: output_init (output.c:140)
==491435==    by 0x10AB29: main (class.c:72)
==491435==  Uninitialised value was created by a heap allocation
==491435==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==491435==    by 0x12C9F3: hyperspherical_HIS_create._omp_fn.0 (hyperspherical.c:142)
==491435==    by 0x49C18E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==491435==    by 0x131F44: hyperspherical_HIS_create (hyperspherical.c:137)
==491435==    by 0x1C0298: transfer_update_HIS (transfer.c:4418)
==491435==    by 0x1C0875: transfer_init._omp_fn.0 (transfer.c:356)
==491435==    by 0x49C18E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==491435==    by 0x1C32EF: transfer_init (transfer.c:316)
==491435==    by 0x10AA8C: main (class.c:52)

The referenced heap allocation is this line: class_alloc_parallel(PhiL,(lmax+2)*sizeof(double)*_HYPER_CHUNK_,error_message);

Since the corresponding free is already in line 231, this limits the bug to PhiL not being correctly determined in the forwards_recurrence_chunk or in the backwards_recurrence method, and the incorrect values being written to either pHIS->phi or pHIS->dphi. Making valgrind zero in more closely on the exact line is a bit hard, I will try a bit more with gdb.

schoeneberg commented 1 year ago

Dear @borisbolliet , yes it seems like this is a bug that also @ThomasTram (as the author of this module) should be made aware of. The problem is that when sgnK=1 and lmax = beta -1 , the code only sets PhiL[lmax+1] = 0. (line 150), but it needs to set PhiL[(lmax+1)*current_chunk+index_x] also to zero for line 227, where the bug arises (since this is not done currently). For now a simple fix is to insert this right before line 150:

for(j=0;j<(lmax+2)*_HYPER_CHUNK_;++j){PhiL[j]=0.0;}

I will also implement it, and hopefully the next CLASS version should not have the problems anymore. I checked both that valgrind does not complain anymore and that your testing python script now runs through without problems.

ThomasTram commented 1 year ago

@borisbolliet Thanks for finding this! That was a really nasty bug: it was intermittent because it only happened if a certain location in memory was randomly NaN. @schoeneberg found the right location: it happens when you hit the "ceiling" in closed models: we then have to reduce lmax by 1, but this means that the derivative calculation could be wrong. However, it will be correct if we put the value to zero, hence the PhiL[lmax+1] = 0. statement.

However, as @schoeneberg also found, many locations in that array has to be set to 0 since there is a chunk size and it can vary. Thus, the solution is indeed to zero the whole array in this situation, just as suggested by @schoeneberg .

borisbolliet commented 1 year ago

Thank you so much for addressing this ! I guess I can just implemented that then, closing the issue for now :)

lesgourg / class_public

Class unstable wrt to Omega_k computations? #505