screen NaN inputs? - Githubissues

jonathanstrong commented 5 years ago

Summary:

The error message encountered when passing data inputs with NaN is confusing. Perhaps optimizing and sampling should screen items in data dictionary for NaN.

Description:

Brand new to pystan, spent a few minutes struggling with this error, which occurred because I was accidentally passing input data with NaN values in it:

RuntimeError                              Traceback (most recent call last)
<ipython-input-40-e1350085e5e3> in <module>
      5 
      6 fit = (model.sampling(data=dict(y=scaler.transform(df[['y']].values)[:,0], X=noise(scaler.transform(df[['y']].values)), 
----> 7                                      N=len(df), K=1), iter=10, verbose=True)) #, init=lambda : {'beta': np.random.random(1) * 5, 'alpha': 1.0, 'sigma': 1.0}))

~/src/envs/gnn/lib/python3.7/site-packages/pystan/model.py in sampling(self, data, pars, chains, iter, warmup, thin, seed, init, sample_file, diagnostic_file, verbose, algorithm, control, n_jobs, **kwargs)
    776         call_sampler_args = izip(itertools.repeat(data), args_list, itertools.repeat(pars))
    777         call_sampler_star = self.module._call_sampler_star
--> 778         ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs)
    779         samples = [smpl for _, smpl in ret_and_samples]
    780 

~/src/envs/gnn/lib/python3.7/site-packages/pystan/model.py in _map_parallel(function, args, n_jobs)
     83         try:
     84             pool = multiprocessing.Pool(processes=n_jobs)
---> 85             map_result = pool.map(function, args)
     86         finally:
     87             pool.close()

/usr/lib/python3.7/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

RuntimeError: Initialization failed.

Since I was working in a jupyter notebook, the c++ stderr output was not visible. Eventually I noticed it included a message about NaN:

Rejecting initial value:         
  Error evaluating the log probability at the initial value.                                                                          
Exception: normal_lpdf: Random variable[11690] is nan, but must not be nan!  (in 'unknown file name' at line 23)

Reproducible Steps:

The model I was running is the linear regression example from the stan manual, nothing exotic. I simply passed data with NaN values in the data dictionary to optimizing and sampling.

Current Output:

Error messages quoted above.

Expected Output:

I would have appreciated a clearer error message. It might be worth considering screening inputs before passing them to c++.

PyStan Version:

2.19.0.0

Python Version:

3.7.4

Operating System:

ubuntu 18.04

ahartikainen commented 5 years ago

~Did you have a nan in your data?~

Yeah, I think we could add nan checks.

bob-carpenter commented 5 years ago

There's nothing intrinsically wrong with NaN in inputs.

The error is arising when the normal log probability density function is being evaluated. The normal distribution throws an exception when it gets out-of-domain arguments, which is where the error message is coming from. The complication is that it doesn't know where in the Stan program the error arose---the error comes from the C++ code in the normal distribution. It would be much clearer if we were able to say found y to be NaN in y ~ normal(...) rather than just reporting about the "variate" (the variate is the outcome y in normal_lpdf(y | mu, sigma)). At least it's better than it used to be in that it's telling you which entry in the vector is the problem.

It probably wouldn't be terrible to forbid NaN inputs, but that's not how CmdStan is going to operate, but that would be a PyStan-specific decision and may cause models that work in CmdStan, RStan, etc., to fail in PyStan.

jstrong-tios commented 5 years ago

I put it as a question because I wasn't sure if there were cases where NaN inputs to Stan are used (intentionally). One option would be for pystan to warn whenever there are NaN values present. Another would be to catch any exceptions from the c++ code and then check for NaN and include NaN-related warnings in the exception output. It seems unlikely that improving the c++ error message is the easiest route.

bob-carpenter commented 5 years ago

Thanks---those both sound like promising approaches.

The input warning would be more robust, because NaN errors can pop up other than from NaN input (e.g., 0 / 0 or inf - inf).

You're right that doing this from the C++ side would be nearly impossible---we'd need to thread calling information through to every function.

stan-dev / pystan2

screen NaN inputs? #669