Given failure data that follows a Tweedie distribution. I wanted to attempt to model this with a lognormal distribution in ngboost as the number of zero's is low. However as implemented I get the following error:
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 276, in fit
self.fit_init_params_to_marginal(Y)
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/ngboost.py", line 121, in fit_init_params_to_marginal
self.init_params = self.Manifold.fit(
File "/Users/rhys/opt/miniconda3/envs/py310/lib/python3.10/site-packages/ngboost/distns/lognormal.py", line 124, in fit
m, s = sp.stats.norm.fit(np.log(Y))
File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
return fun(self, *args, **kwds)
File "/Users/rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.
If I add a small positive amount of noise to the Y labels then the model trains and this is probably the solution in my case but wanted to highlight it explicitly.
>>> a = np.exp(np.random.randn(1000000))
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985804299185244)
>>> scipy.stats.norm.fit(np.log(a))
(0.0007051204420198203, 0.9990557672515611)
>>> a[0]=0
>>> scipy.stats.lognorm.fit(a, floc=0, method="MM")
(1.0019818648205723, 0, 0.9985805077931086)
>>> scipy.stats.norm.fit(np.log(a))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 63, in wrapper
return fun(self, *args, **kwds)
File "/Users/chemix-rhys/.local/lib/python3.10/site-packages/scipy/stats/_continuous_distns.py", line 364, in fit
raise RuntimeError("The data contains non-finite values.")
RuntimeError: The data contains non-finite values.
FYI norm fit returns mu, sigma and lognorm fit returns s, loc, scale where s = sigma and scale = exp(mu). Have to use MM fit as opposed to MLE fit due to MLE also imposing a > 0 requirement. As there appears to be a potential speed penalty associated with this perhaps it can be only used if Y contains 0?
(base) ➜ ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.norm.fit(np.log(a))'
5000 loops, best of 5: 89.7 usec per loop
(base) ➜ ~ python -mtimeit -s'import numpy as np; import scipy.stats; np.seed=0; a=np.exp(np.random.randn(10000))' 'scipy.stats.lognorm.fit(a, floc=0, method="MM")'
50 loops, best of 5: 5.14 msec per loop
Given failure data that follows a Tweedie distribution. I wanted to attempt to model this with a lognormal distribution in ngboost as the number of zero's is low. However as implemented I get the following error:
If I add a small positive amount of noise to the Y labels then the model trains and this is probably the solution in my case but wanted to highlight it explicitly.
FYI norm fit returns mu, sigma and lognorm fit returns s, loc, scale where s = sigma and scale = exp(mu). Have to use MM fit as opposed to MLE fit due to MLE also imposing a > 0 requirement. As there appears to be a potential speed penalty associated with this perhaps it can be only used if Y contains 0?