scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
12.91k stars 5.14k forks source link

alternative parametrizations of univariate distributions #4538

Open ev-br opened 9 years ago

ev-br commented 9 years ago

scipy.stats defines univariate distributions in a standard form, with location and scale parameters. For some distributions, these do not agree with well with "common" parametrizations, which leads to confusion on various levels.

It might be good to at least identify the most problematic cases and add some text to their docstrings. Possibly there might be a way of adding some actual conversion code(?)

(cross-ref: https://github.com/scipy/scipy/pull/4533#issuecomment-75260691)

Likely suspects:

aconley commented 9 years ago

This would be wonderful. I, at least, found this to be the single biggest impediment to using scipy.stats -- trying to relate the 'standard forms' one expects to the way they are actually written.

Conversion code would also be great -- but I suspect that means somebody would just have to go in and add it by hand, since doing so in any automated way seems difficult. But I would be happy to volunteer some effort along those lines.

josef-pkt commented 9 years ago

I don't think there are so many distributions that have a non-standard parameterization. One problem is also that some distributions have several "standard" parameterizations (at least in some other statistics packages).

The point of the re-parameterization functions would be to get away from the strict loc-scale generic code, so it would have to be distribution specific, and will have to be added on a case by case basis.

One more problem is if the reduced number of parameter version doesn't map nicely to the current structure, for example distribution that have a common 2-, 3- and/or 4 parameter version.

stats.lognorm(stats.lognorm._transform_in(*my_params)).pdf(x) stats.lognorm._transform_out(*stats.lognorm.fit(x))

stats.lognorm(stats.lognorm._transform_in(*my_params, kind='parameterization_3_of_xxx')).pdf(x)

aconley commented 9 years ago

I guess that depends where your threshold for 'non-standard' is. I will grant there are not that many where the scale/loc parameterization is so different that the reader is left scratching their head (and I would add exponnorm to that list, assuming #4533 gets accepted). But there are a lot of lesser differences that can still throw people.

For example, most people probably expect the parameter of the expon distribution to be lambda, rather than scale = 1 / lambda. And it would be nice -- if perhaps not quite as critical -- to support that as well.

ev-br commented 9 years ago

My 2p: looking at Josef's examples I'm sure I personally would find it easier to type up the three-line conversion function myself rather than trying to look up the exact definition of these helpers, esp with keyword parameters. Easier to work out two short formulas than the same formulas and obscure helpers with obscure options which need to be checked anyway.

For me the most helpful would really be a line in the docstring, saying literally (taking exponnorm as a recent example):

""" Note: To get the parametrization in terms of $\lambda$ and $\sigma$, as used in [wikipedia link] use

$$ loc = \mu + \lambda * s*2 scale = s \xi = \sigma \ \lambda $$ """ (OK, not a line; four of those)

As far as scale for expon and similar are concerned, I think it would be best to somehow more prominently feature what these loc and scale actually are. An attempt is made in gh-4499 (now merged and the result is even available in the dev version docs online thanks to Pauli). Does it make it clearer? One other thing I guess would be good to have is to somehow link the stats tutorial from the reference docs. I definitely remember things got much clearer when somebody pointed me to the tutorial.

josef-pkt commented 9 years ago

you need the conversion in both direction How do I get (mu, sigma, lambda) from the return of exponnorm.fit ? Do I have to write unit tests each time I use this to make sure I didn't make a typo?

What's the variance of exp(y) with y ~ N(mu, sigma2)? lognorm.stats(???) I looked at this several times for various questions, and I still don't remember, and have to read the docstring and try out a few examples each time.

ev-br commented 9 years ago

FWIW, I wouldn't be negative if someone wants to add these sorts of method. As long as these methods are documented, which covers my base :-)

josef-pkt commented 9 years ago

the newest installment: http://stackoverflow.com/questions/28700694/log-normal-random-variables-with-scipy especially last revision http://stackoverflow.com/posts/28701211/revisions "now that I know how it behaves, I realize that be behavior in principle is documented."

pv commented 9 years ago

Maybe all formulas in the distribution docstrings need to be written in terms of $y$, and a definition $y = \frac{x - \mathrm{loc}}{\mathrm{scale}}$ is added beside the main formula?

ev-br commented 9 years ago

gh-4559

The example starts to approach a "too long to read" state though.

pv commented 9 years ago

I think adding the definition of the loc,scale transformation next to the pdf formula in all docstrings is probably the way that removes all ambiguity, and makes it easier to see what the mapping to "standard" forms is.

argriffing commented 9 years ago

If we are making a list, Wikipedia gives 3 parameterizations for the gamma distribution, and the PR https://github.com/scipy/scipy/pull/4605 seemed to have a request for a parameterization for the rice distribution that would be somewhat analogous to the last of the three gamma distribution parameterizations.

ev-br commented 9 years ago

I've edited rice into the list.

josef-pkt commented 8 years ago

adding link here https://github.com/phobson/paramnormal

ev-br commented 8 years ago

paramnormal is extremely cool! I wonder why it's not on pypi and in the topical sodtware list. Also mentions in the stats tutorial / individual docstrings of distributions which are available in paramnormal could be nice to add.

josef-pkt commented 8 years ago

@phobson

I don't remember why I found it, or whether Paul had mentioned it somewhere.

phobson commented 8 years ago

@ev-br it's not on PyPI b/c I haven't had anyone really review it for correctness and I'm lazy. Glad you think it might be useful though.

phobson commented 8 years ago

@ev-br @josef-pkt That said, if you think it should be on pypi, create an issue for that on the repo and I'll upload it.

NeilGirdhar commented 8 years ago

I agree with ev-br about the problem. However, I think we need more than a documentation change. I think that the distribution should be using the most standard parametrizations. The user should not be forced to write every distribution as a localtion and scale family distribution, especially not for distributions for which that would be awkward (at best).

See #6038 for my proposal to make location, scale and truncation adaptors. I propose writing the distributions in the usual way and then using adaptors when an alternate form is desired.

dmadeka commented 7 years ago

Regarding the lognormal, I asked this on a numpy issue, but would there be an interest to have a multivariate lognormal that is parameterized by the mean and covariance matrix of the lognormal rather than the underlying normal ala https://gist.github.com/dmadeka/5ee0c54ba5184235830594d769d9520b