sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
247 stars 45 forks source link

[ENH] roadmap of probability distributions to implement #22

Open fkiraly opened 1 year ago

fkiraly commented 1 year ago

It would be great to have a basic set of probability distributions implemented.

Umbrella issue for implementing sktime probability distributions.

Recipe: use the extension_templates/distribution.py extension template. Examples:

High priority:

mid priority:

low priority:

lower priority:

list of many more (lowest priority) https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions - can be interfaced via _ScipyDist adapter easily! https://en.wikipedia.org/wiki/File:ProbOnto2.5.jpg

Mirrors https://github.com/sktime/sktime/issues/4518 (for high and mid priority)

Contributions can be made to either repository, and should be copied over to the other once approved/merged, until the modules are merged into one.

bhavikar04 commented 8 months ago

Hi, I'm interested in taking this up. Would you say priority of the distributions aligns with the difficulty in implementation? I'd like to do either multivariate normal or uniform continuous.

fkiraly commented 8 months ago

Hmmm, I'd say it is currently actually the opposite way. That is, the remaining low priority ones are easier to get started with, than the remaining high priority ones - simply since the easy higher priority ones are already done.

So, uniform continuous then? Parameterized by lower and upper.

I don't have a reference for energy and squared norm integrals, but these should not be too difficult to obtain. Let me know if you need input there, we can always start with the more common methods.

an20805 commented 8 months ago

Hey @fkiraly, I have implemented uniform continuous distribution in my local branch. How do I proceed further? I would also love to implement other distributions.

fkiraly commented 7 months ago

@an20805, nice! Let's not duplicate then, @bhavikar04 - how about beta?

The next step would be making a pull request to this repository, and a review cycle, then merge.

fkiraly commented 7 months ago

Re energy, for $X, Y\sim Unif(a, b)$, I get:

$\mathbb{E}[|X - y|] = |y - \frac{b+a}{2} |$ if $y$ lies outside $[a, b]$, and $\mathbb{E}[|X - y|] = \frac{(b-y)^2}{2(b-a)}+ \frac{(a-y)^2}{2(b-a)}$ if inside,

and

$\mathbb{E}[|X - Y|] = \frac{1}{3} (b-a)$ - double checking appreciated.

bhavikar04 commented 7 months ago

In that case I'll take up log normal distribution then.

fkiraly commented 7 months ago

Pinging @Alex-JG3 and @ivarzap who most recently implemented distributions, in case you have any general starter advice.

bhavikar04 commented 7 months ago

Hey,

So I'm a little unsure on what the energy will be for the log normal distribution and can't find much online, is there any literature you can point me to?

fkiraly commented 7 months ago

@bhavikar04, Appendix A.2 of "evaluating forecasts with scoringRules" has a few explicit formulae for the energy, including the log-normal distribution. The expression is hard to track in implementation, so I would advise comparing against the Monte-Carlo default if you implement it.

I would also suggest you try it on paper, there's a good chance of errors in rare calculations like these. Further, Wolfram Alpha might also help. Whereas, ChatGPT and the like typically produce plausible garbage.

bhavikar04 commented 7 months ago

Hey thank you so much, I'll try to chalk out a suitable implementation soon. ChatGPT was humble enough to admit it doesn't know enough ;)

fkiraly commented 7 months ago

Yes, I admit I also tried as computing integrals can get tiring: https://xkcd.com/2117/ Wolfram is not bad, it makes sense to double check though. As said, there is a default Monte Carlo implementation, so if you set the number of samples high, the matrices should be similar.

sukjingitsit commented 7 months ago

I would like to work on implementing the chi-square distribution. To confirm, we have to follow the template of Laplace and Normal, where we implement the mean, var, pdf, cdf, logpdf and ppf alongside the energy, right? To characterise chi-square, I assume, as standard practice, we will use the degrees of freedom, right?

sukjingitsit commented 7 months ago

The current implementation of ppf wraps a scipy function directly due to lack of a closed mathematical form. Similarly, while cross-energy can be mathematically derived, self-energy is difficult to solve (nor could I find literature on it) in a closed form, the best options for that is integration or sampling. Thus, energy hasn't been implemented yet

fkiraly commented 7 months ago

To confirm, we have to follow the template of Laplace and Normal, where we implement the mean, var, pdf, cdf, logpdf and ppf alongside the energy, right?

Yes, that's what I would recommend.

fkiraly commented 7 months ago

Similarly, while cross-energy can be mathematically derived, self-energy is difficult to solve (nor could I find literature on it) in a closed form, the best options for that is integration or sampling. Thus, energy hasn't been implemented yet

Hm, interesting. You should put your findings in the docstring of the function, you can write TeX with the :math: directive in rst. We could return the exact formula for the cross-term, and an approximation for the self-term.

In this case, a more efficient sampling strategy would be available, as one can sample the constant in the cross-term from chi-squared, to obtain a Monte Carlo estimate for the self-term.

What kind of integral do you get if you first integrate out one variable? That's the cross-term, which you seem to imply has a closed form.

fkiraly commented 7 months ago

@an20805, would you be so kind to open a PR with your partial work on uniform distribution, even if not finished? Would be a shame if it got lost.

fkiraly commented 7 months ago

Also, in general, perhaps it helps - there is a relation between energy formulae and truncated distributions.

Let $d$ be a distribution supported on the reals, and denote by $\mbox{trunc}(d, a, b)$ the distribution supported on the interval $[a, b]$, which is $d$ conditioned on the value being in the interval $[a, b]$.

In general, we have $d.\mbox{energy}(c) = \left(\mbox{trunc}(d, c, \infty).\mbox{mean} - c\right)\cdot (1- d.\mbox{cdf}(c)) - \left(c - \mbox{trunc}(d, -\infty, c).\mbox{mean}\right) \cdot d.\mbox{cdf}(c)$.

So, for instance, if you know cdf of $d$, and the means of truncated versions of $d$, you get the cross-term in the energy already.

For the self-term, you could use $d.\mbox{energy}() = \mathbb{E}\left[ d.\mbox{energy}(X)\right]$, where $X$ is distributed according to $d$. This may give an easier to tackle integral when using the truncated $d$.

FYI @sukjingitsit, have you checked literature on means or moments of truncated chi-squared distributions? As said, no need to, if it looks like a too deep rabbit hole.

sukjingitsit commented 7 months ago

I will look over this once for the self-case

fkiraly commented 7 months ago

To characterise chi-square, I assume, as standard practice, we will use the degrees of freedom, right?

Just saw this, overlooked this at first. Yes, that's how I would parameterize it.

an20805 commented 7 months ago

@an20805, would you be so kind to open a PR with your partial work on uniform distribution, even if not finished? Would be a shame if it got lost.

Hello @fkiraly, I am sorry for delaying the PR this much. I had an accident and wasn't able to work for more than a week. I am all good now and got back yesterday. I have opened a PR https://github.com/sktime/skpro/pull/223. I am having some issues, would like to get some help.

fkiraly commented 7 months ago

sorry to hear, @an20805. Good to hear you are back! I'll reply on #223.

malikrafsan commented 7 months ago

Hi @fkiraly đŸ‘‹ I want to start contributing here, I noticed that some of the unchecked tasks are already being assigned/reviewed. Do you have any recommendations on what distributions should I try to implement? I would be more than happy to contribute to this project, Thank you!

fkiraly commented 7 months ago

@malikrafsan, welcome!

The ones which don't have anyone talking about are still available. The ProbOnto link has more.

We could also look into integer valued, e.g., Binomial, Poisson - these are common for GLM.

malikrafsan commented 7 months ago

Hi @fkiraly I am very interested in implementing Logistic/Weibull Distribution. However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first? Thank you!

fkiraly commented 7 months ago

However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first?

There is an approximative (Monte Carlo) default if this is not easy to obtain - in any case it can be done later (or not at all).

Thanks for contributing!

malikrafsan commented 7 months ago

Ahh, I see, thank you so much for your guidance @fkiraly ! Then my PRs are ready to be reviewed. I would very much love to hear your feedback. However, if you have any reference on energy formula of those two distribution, then I would still very like to implement it, thank you so much!

fkiraly commented 7 months ago

if you have any reference on energy formula of those two distributiom

have you checked in the paper above? If not there, one would have to derive it.

malikrafsan commented 7 months ago

Do you mean this paper?

Appendix A.2 of "evaluating forecasts with scoringRules" has a few explicit formulae for the energy, including the log-normal distribution.

Yes, I have checked the paper but I cannot find the formula. I can only find CRPS and CDF formulas. Does CRPS mean the energy? If so, I think I misunderstood your previous statements

fkiraly commented 7 months ago

Yes, CRPS is closely releated, it is the cross-term minus half the self-term (compare definitions).

The unfortunate bit about the paper is that it only gives CRPS, but not the self-term or cross-term in isolation. However, it should not be too hard to back these out, using that shifting the distribution location by a constant leaves the self-term unchanged, but not the cross-term.

fkiraly commented 7 months ago

More precisely, a useful formula to use is

$$\lim_{y \rightarrow \infty} \mbox{CRPS}(y) - y = -\mathbb{E}[X] - \frac{1}{2}\mathbb{E}|X - X'|,$$

i.e., you can obtain the cross-term via taking a limit, if you know the expressions for CRPS and the expectation already.

(the equation follows from observing that the absolute value in $\mathbb{E}\left|y - X \right|$ disappears in the limit)

sukjingitsit commented 4 months ago

I would like to work on the Pareto distribution if possible

fkiraly commented 4 months ago

all yours, @sukjingitsit!