Open ev-br opened 9 years ago
This would be wonderful. I, at least, found this to be the single biggest impediment to using scipy.stats
-- trying to relate the 'standard forms' one expects to the way they are actually written.
Conversion code would also be great -- but I suspect that means somebody would just have to go in and add it by hand, since doing so in any automated way seems difficult. But I would be happy to volunteer some effort along those lines.
I don't think there are so many distributions that have a non-standard parameterization. One problem is also that some distributions have several "standard" parameterizations (at least in some other statistics packages).
The point of the re-parameterization functions would be to get away from the strict loc-scale generic code, so it would have to be distribution specific, and will have to be added on a case by case basis.
One more problem is if the reduced number of parameter version doesn't map nicely to the current structure, for example distribution that have a common 2-, 3- and/or 4 parameter version.
stats.lognorm(stats.lognorm._transform_in(*my_params)).pdf(x)
stats.lognorm._transform_out(*stats.lognorm.fit(x))
stats.lognorm(stats.lognorm._transform_in(*my_params, kind='parameterization_3_of_xxx')).pdf(x)
I guess that depends where your threshold for 'non-standard' is. I will grant there are not that many where the scale/loc parameterization is so different that the reader is left scratching their head (and I would add exponnorm
to that list, assuming #4533 gets accepted). But there are a lot of lesser differences that can still throw people.
For example, most people probably expect the parameter of the expon
distribution to be lambda
, rather than scale = 1 / lambda
. And it would be nice -- if perhaps not quite as critical -- to support that as well.
My 2p: looking at Josef's examples I'm sure I personally would find it easier to type up the three-line conversion function myself rather than trying to look up the exact definition of these helpers, esp with keyword parameters. Easier to work out two short formulas than the same formulas and obscure helpers with obscure options which need to be checked anyway.
For me the most helpful would really be a line in the docstring, saying literally (taking exponnorm as a recent example):
""" Note: To get the parametrization in terms of $\lambda$ and $\sigma$, as used in [wikipedia link] use
$$ loc = \mu + \lambda * s*2 scale = s \xi = \sigma \ \lambda $$ """ (OK, not a line; four of those)
As far as scale
for expon
and similar are concerned, I think it would be best to somehow more prominently feature what these loc and scale actually are. An attempt is made in gh-4499 (now merged and the result is even available in the dev version docs online thanks to Pauli). Does it make it clearer?
One other thing I guess would be good to have is to somehow link the stats tutorial from the reference docs. I definitely remember things got much clearer when somebody pointed me to the tutorial.
you need the conversion in both direction How do I get (mu, sigma, lambda) from the return of exponnorm.fit ? Do I have to write unit tests each time I use this to make sure I didn't make a typo?
What's the variance of exp(y) with y ~ N(mu, sigma2)?
lognorm.stats(???)
I looked at this several times for various questions, and I still don't remember, and have to read the docstring and try out a few examples each time.
FWIW, I wouldn't be negative if someone wants to add these sorts of method. As long as these methods are documented, which covers my base :-)
the newest installment: http://stackoverflow.com/questions/28700694/log-normal-random-variables-with-scipy especially last revision http://stackoverflow.com/posts/28701211/revisions "now that I know how it behaves, I realize that be behavior in principle is documented."
Maybe all formulas in the distribution docstrings need to be written in terms of $y$, and a definition $y = \frac{x - \mathrm{loc}}{\mathrm{scale}}$ is added beside the main formula?
gh-4559
The example starts to approach a "too long to read" state though.
I think adding the definition of the loc,scale transformation next to the pdf formula in all docstrings is probably the way that removes all ambiguity, and makes it easier to see what the mapping to "standard" forms is.
If we are making a list, Wikipedia gives 3 parameterizations for the gamma distribution, and the PR https://github.com/scipy/scipy/pull/4605 seemed to have a request for a parameterization for the rice distribution that would be somewhat analogous to the last of the three gamma distribution parameterizations.
I've edited rice into the list.
adding link here https://github.com/phobson/paramnormal
paramnormal
is extremely cool! I wonder why it's not on pypi and in the topical sodtware list. Also mentions in the stats tutorial / individual docstrings of distributions which are available in paramnormal could be nice to add.
@phobson
I don't remember why I found it, or whether Paul had mentioned it somewhere.
@ev-br it's not on PyPI b/c I haven't had anyone really review it for correctness and I'm lazy. Glad you think it might be useful though.
@ev-br @josef-pkt That said, if you think it should be on pypi, create an issue for that on the repo and I'll upload it.
I agree with ev-br about the problem. However, I think we need more than a documentation change. I think that the distribution should be using the most standard parametrizations. The user should not be forced to write every distribution as a localtion and scale family distribution, especially not for distributions for which that would be awkward (at best).
See #6038 for my proposal to make location, scale and truncation adaptors. I propose writing the distributions in the usual way and then using adaptors when an alternate form is desired.
Regarding the lognormal, I asked this on a numpy issue, but would there be an interest to have a multivariate lognormal that is parameterized by the mean and covariance matrix of the lognormal rather than the underlying normal ala https://gist.github.com/dmadeka/5ee0c54ba5184235830594d769d9520b
scipy.stats defines univariate distributions in a standard form, with location and scale parameters. For some distributions, these do not agree with well with "common" parametrizations, which leads to confusion on various levels.
It might be good to at least identify the most problematic cases and add some text to their docstrings. Possibly there might be a way of adding some actual conversion code(?)
(cross-ref: https://github.com/scipy/scipy/pull/4533#issuecomment-75260691)
Likely suspects: