parameter set causing "Segmentation fault" in python wrapper

lukashergt commented 3 years ago

The following parameter set causes a Segmentation fault in the 'background' module (both python wrapper and command line call to CLASS):

Example python script to reproduce the error:

from classy import Class

paramdict = {'A_s': 1.9277426921445484e-09, 
             'n_s': 0.9451570080189712, 
             '100*theta_s': 1.045577375117725, 
             'omega_b': 0.017612175306787197, 
             'omega_cdm': 0.30057249803499375, 
             'm_ncdm': 0.06, 
             'tau_reio': 0.3055598544108618, 
             'N_ncdm': 1, 
             'N_ur': 2.0328, 
             'output': 'tCl'}

cosmo = Class()
cosmo.set(paramdict)
cosmo.compute(['background'])

I've chased it down to line 1355 in tools/evolver_ndf15.c:

  1339    else{
  1340      /*Normal case:*/
  1341      for(j=1;j<=neq;j++){
  1342        Fdiff_new = 0.0;
  1343        Fdiff_absrm = 0.0;
  1344        for(i=1;i<=neq;i++){
  1345          Fdiff_absrm = MAX(fabs(Fdiff_new),Fdiff_absrm);
  1346          Fdiff_new = nj_ws->ydel_Fdel[i][j] - fval[i];
  1347          dFdy[i][j] = Fdiff_new/nj_ws->del[j];
  1348          /*Find row maximums:*/
  1349          if(fabs(Fdiff_new)>=Fdiff_absrm){
  1350            /* Found new max location in column */
  1351            nj_ws->Rowmax[j] = i;
  1352            nj_ws->Difmax[j] = fabs(Fdiff_new);
  1353          }
  1354        }
  1355        nj_ws->absFdelRm[j] = fabs(nj_ws->ydel_Fdel[nj_ws->Rowmax[j]][j]);
  1356      }
  1357    }

nj_ws->Rowmax[j] returns some random number, so I guess j overshoots the Rowmax array somehow. I'm a bit lost with all these indices. Can anyone else reproduce this? Anybody an idea what might be going wrong here?

Versions:

Python 3.8.2
classy v3.0.1

pstoecker commented 3 years ago

Hi @lukashergt ,

this is interesting. I have seen a similar issue in the context of energy injection with class v3.0.0 and I know exactly what goes wrong here. For your parameter combination, some, if not all, entries of the DE system that will be evolved by ndf15 contain a NaN such that the calculation that determines the value of nj_ws->Rowmax[j] fails and it leaves nj_ws->Rowmax[j] uninitialised such that you go out of bounds of the nj_ws->ydel_Fdel array and you access random data.

This can be fixed by explicitly checking the derivatives and throwing an error if they are ill-defined.

Best, Patrick

fruzsinaagocs commented 3 years ago

Hi @pstoecker and @lukashergt,

I tried reproducing this issue to see if my fix over at https://github.com/GambitBSM/gambit/issues/278 worked here. I couldn't reproduce the issue with either of these:

OS: Ubuntu 20.04 (personal machine), CentOS 8 (the Gadi cluster)
compiler: gcc 8.3.1, 9.3.0
Python: 3.7.4 (but also tried command line CLASS, without the Python wrapper)
classy 3.0.1

In a discussion with Lukas, we pinned down the following:

The issue was coming from the initial run of the 'shooting' method in CLASS, when it tries to determine H_0 from 100*theta_s.
During shooting, some background evolutions with ndf15 are ran. For these parameters, we get a shrinking H, and it trickles down from some enormous value all the way to nearly 0.
On Lukas' system, this initial background evolution fails to stop in time and produces NaNs for all background variables. On my system(s), it doesn't, although I'm doubtful the H I'm getting is correct.
For Lukas, the shooting produces a shooting failed error message, but the error is delayed until later:

/ If shooting fails, postpone error to background module to play nice with MontePython. /
The shooting failed error message is only visible with my fix implemented, otherwise we get a segmentation fault.
We tried various compiler flag options, most notably we switched off optimization flags, but Lukas still gets the error.

I'm going to look into

why the background evolution during shooting fails to terminate properly for him (although this is hard, as I can't reproduce the issue),
why the initial H CLASS starts shooting from is unnecessarily large.

But this discussion makes me more certain that the patch will not rule out valid points in parameter space - we definitely want to avoid that happening.

pstoecker commented 3 years ago

OK, I see.

I think there are two issues to fix here. The first thing is that ndf15 has to be fixed such that it throws an error whenever the derivatives are ill-defined. This should be independent of the question whether the parameter is actually fine and just due to some weird (system-dependent) computational issue the problem arises. Therefore it would be good if we could try your fix from GambitBSM/gambit#278 here

The other thing is the particular issue that is happening here. Also on my system (ubuntu 18.04, gcc 7.5) this particular parameter point does not crash. I haven't checked the results though but I guess that I will see a similar doubtful value of H. Given your explanation, I agree that this needs some investigation.

fruzsinaagocs commented 3 years ago

Ah, I should've been more clear - we tried the fix from https://github.com/GambitBSM/gambit/issues/278 and it successfully avoids the seg fault, so CLASS fails in a more elegant way and doesn't disrupt the MCMC chains.

lukashergt commented 3 years ago

Hi @fruzsinaagocs and @pstoecker, thanks for both your inputs that was very helpful. The ndf15-fix definitely proved useful in some cases.

I found another issue, though, that relates to this particular case and to curvature runs in general. The following lines in source/background.c have been commented out:

  /** - control that cosmological parameter values make sense, otherwise inform user */

  /* H0 in Mpc^{-1} */
  /* Many users asked for this test to be supressed. It is commented out. */
  /*class_test((pba->H0 < _H0_SMALL_)||(pba->H0 > _H0_BIG_),
             pba->error_message,
             "H0=%g out of bounds (%g<H0<%g) \n",pba->H0,_H0_SMALL_,_H0_BIG_);*/

This test is important to filter out cases such as collapsing universes. I don't know why "many users asked for this test to be suppressed", but there are definitely situations where these are relevant. Maybe an approach similar to CAMB would be useful, where you are allowed to set a viable H0 range, maybe with the additional option of leaving it unspecified?

lesgourg / class_public

parameter set causing "Segmentation fault" in python wrapper #419