Open fkiraly opened 1 year ago
Hi, I'm interested in taking this up. Would you say priority of the distributions aligns with the difficulty in implementation? I'd like to do either multivariate normal or uniform continuous.
Hmmm, I'd say it is currently actually the opposite way. That is, the remaining low priority ones are easier to get started with, than the remaining high priority ones - simply since the easy higher priority ones are already done.
So, uniform continuous then? Parameterized by lower and upper.
I don't have a reference for energy and squared norm integrals, but these should not be too difficult to obtain. Let me know if you need input there, we can always start with the more common methods.
Hey @fkiraly, I have implemented uniform continuous distribution in my local branch. How do I proceed further? I would also love to implement other distributions.
@an20805, nice! Let's not duplicate then, @bhavikar04 - how about beta?
The next step would be making a pull request to this repository, and a review cycle, then merge.
Re energy, for $X, Y\sim Unif(a, b)$, I get:
$\mathbb{E}[|X - y|] = |y - \frac{b+a}{2} |$ if $y$ lies outside $[a, b]$, and $\mathbb{E}[|X - y|] = \frac{(b-y)^2}{2(b-a)}+ \frac{(a-y)^2}{2(b-a)}$ if inside,
and
$\mathbb{E}[|X - Y|] = \frac{1}{3} (b-a)$ - double checking appreciated.
In that case I'll take up log normal distribution then.
Pinging @Alex-JG3 and @ivarzap who most recently implemented distributions, in case you have any general starter advice.
Hey,
So I'm a little unsure on what the energy will be for the log normal distribution and can't find much online, is there any literature you can point me to?
@bhavikar04, Appendix A.2 of "evaluating forecasts with scoringRules
" has a few explicit formulae for the energy, including the log-normal distribution. The expression is hard to track in implementation, so I would advise comparing against the Monte-Carlo default if you implement it.
I would also suggest you try it on paper, there's a good chance of errors in rare calculations like these. Further, Wolfram Alpha might also help. Whereas, ChatGPT and the like typically produce plausible garbage.
Hey thank you so much, I'll try to chalk out a suitable implementation soon. ChatGPT was humble enough to admit it doesn't know enough ;)
Yes, I admit I also tried as computing integrals can get tiring: https://xkcd.com/2117/ Wolfram is not bad, it makes sense to double check though. As said, there is a default Monte Carlo implementation, so if you set the number of samples high, the matrices should be similar.
I would like to work on implementing the chi-square distribution. To confirm, we have to follow the template of Laplace and Normal, where we implement the mean, var, pdf, cdf, logpdf and ppf alongside the energy, right? To characterise chi-square, I assume, as standard practice, we will use the degrees of freedom, right?
The current implementation of ppf wraps a scipy function directly due to lack of a closed mathematical form. Similarly, while cross-energy can be mathematically derived, self-energy is difficult to solve (nor could I find literature on it) in a closed form, the best options for that is integration or sampling. Thus, energy hasn't been implemented yet
To confirm, we have to follow the template of Laplace and Normal, where we implement the mean, var, pdf, cdf, logpdf and ppf alongside the energy, right?
Yes, that's what I would recommend.
Similarly, while cross-energy can be mathematically derived, self-energy is difficult to solve (nor could I find literature on it) in a closed form, the best options for that is integration or sampling. Thus, energy hasn't been implemented yet
Hm, interesting. You should put your findings in the docstring of the function, you can write TeX with the :math:
directive in rst
.
We could return the exact formula for the cross-term, and an approximation for the self-term.
In this case, a more efficient sampling strategy would be available, as one can sample the constant in the cross-term from chi-squared, to obtain a Monte Carlo estimate for the self-term.
What kind of integral do you get if you first integrate out one variable? That's the cross-term, which you seem to imply has a closed form.
@an20805, would you be so kind to open a PR with your partial work on uniform distribution, even if not finished? Would be a shame if it got lost.
Also, in general, perhaps it helps - there is a relation between energy formulae and truncated distributions.
Let $d$ be a distribution supported on the reals, and denote by $\mbox{trunc}(d, a, b)$ the distribution supported on the interval $[a, b]$, which is $d$ conditioned on the value being in the interval $[a, b]$.
In general, we have $d.\mbox{energy}(c) = \left(\mbox{trunc}(d, c, \infty).\mbox{mean} - c\right)\cdot (1- d.\mbox{cdf}(c)) - \left(c - \mbox{trunc}(d, -\infty, c).\mbox{mean}\right) \cdot d.\mbox{cdf}(c)$.
So, for instance, if you know cdf of $d$, and the means of truncated versions of $d$, you get the cross-term in the energy already.
For the self-term, you could use $d.\mbox{energy}() = \mathbb{E}\left[ d.\mbox{energy}(X)\right]$, where $X$ is distributed according to $d$. This may give an easier to tackle integral when using the truncated $d$.
FYI @sukjingitsit, have you checked literature on means or moments of truncated chi-squared distributions? As said, no need to, if it looks like a too deep rabbit hole.
I will look over this once for the self-case
To characterise chi-square, I assume, as standard practice, we will use the degrees of freedom, right?
Just saw this, overlooked this at first. Yes, that's how I would parameterize it.
@an20805, would you be so kind to open a PR with your partial work on uniform distribution, even if not finished? Would be a shame if it got lost.
Hello @fkiraly, I am sorry for delaying the PR this much. I had an accident and wasn't able to work for more than a week. I am all good now and got back yesterday. I have opened a PR https://github.com/sktime/skpro/pull/223. I am having some issues, would like to get some help.
sorry to hear, @an20805. Good to hear you are back! I'll reply on #223.
Hi @fkiraly đŸ‘‹ I want to start contributing here, I noticed that some of the unchecked tasks are already being assigned/reviewed. Do you have any recommendations on what distributions should I try to implement? I would be more than happy to contribute to this project, Thank you!
@malikrafsan, welcome!
The ones which don't have anyone talking about are still available. The ProbOnto link has more.
We could also look into integer valued, e.g., Binomial, Poisson - these are common for GLM.
Hi @fkiraly I am very interested in implementing Logistic/Weibull Distribution. However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first? Thank you!
However, I failed to find the formula for the energy of those two distributions. Can you help me with this? Or is it better if I raise PRs first?
There is an approximative (Monte Carlo) default if this is not easy to obtain - in any case it can be done later (or not at all).
Thanks for contributing!
Ahh, I see, thank you so much for your guidance @fkiraly ! Then my PRs are ready to be reviewed. I would very much love to hear your feedback. However, if you have any reference on energy formula of those two distribution, then I would still very like to implement it, thank you so much!
if you have any reference on energy formula of those two distributiom
have you checked in the paper above? If not there, one would have to derive it.
Do you mean this paper?
Appendix A.2 of "evaluating forecasts with scoringRules" has a few explicit formulae for the energy, including the log-normal distribution.
Yes, I have checked the paper but I cannot find the formula. I can only find CRPS and CDF formulas. Does CRPS mean the energy? If so, I think I misunderstood your previous statements
Yes, CRPS is closely releated, it is the cross-term minus half the self-term (compare definitions).
The unfortunate bit about the paper is that it only gives CRPS, but not the self-term or cross-term in isolation. However, it should not be too hard to back these out, using that shifting the distribution location by a constant leaves the self-term unchanged, but not the cross-term.
More precisely, a useful formula to use is
$$\lim_{y \rightarrow \infty} \mbox{CRPS}(y) - y = -\mathbb{E}[X] - \frac{1}{2}\mathbb{E}|X - X'|,$$
i.e., you can obtain the cross-term via taking a limit, if you know the expressions for CRPS and the expectation already.
(the equation follows from observing that the absolute value in $\mathbb{E}\left|y - X \right|$ disappears in the limit)
I would like to work on the Pareto distribution if possible
all yours, @sukjingitsit!
It would be great to have a basic set of probability distributions implemented.
Umbrella issue for implementing
sktime
probability distributions.Recipe: use the
extension_templates/distribution.py
extension template. Examples:Normal
, for de-novo implementations or manual interfacesFisk
, for interfacingscipy
distributions - this is much easier than using the full templateHigh priority:
mid priority:
low priority:
lower priority:
list of many more (lowest priority) https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions - can be interfaced via
_ScipyDist
adapter easily! https://en.wikipedia.org/wiki/File:ProbOnto2.5.jpgMirrors https://github.com/sktime/sktime/issues/4518 (for high and mid priority)
Contributions can be made to either repository, and should be copied over to the other once approved/merged, until the modules are merged into one.