statrs-dev / statrs

Statistical computation library for Rust
https://docs.rs/statrs/latest/statrs/
MIT License
549 stars 79 forks source link

Remove degenerate cases #102

Open ghost opened 4 years ago

ghost commented 4 years ago

In some places, like the Gamma function, some effort is made to deal with degenerate parameters like infinity. I think this is too much effort for a minute use case, as I don't think users will insert these values to query the library to see what degenerate distributions look like.

Having to deal with the special cases reduces clarity and complicates the API considerably, as it forces to contend with mathematical details difficult to convey in a numerical library. As the limit α -> inf is approached, the pointwise limit of the pdf and the cdf is zero, so these aren't pdfs and cdfs anymore. As the limit β -> inf is approached, the continuous distribution degenerates to a discrete one: P(X = 0) = 1.

I propose to remove all special cases and document why they aren't handled. This would deal with #57 and #98.

boxtown commented 4 years ago

I think we'd have to evaluate this on a case-by-case basis for distributions but I agree I think at least for the Gamma distribution it does not necessarily make sense to allow INF for α or β. I'll create an issue specifically for the Gamma distribution

dhardy commented 4 years ago

FYI, we've recently had someone interested in supporting a particular degenerate case of Exp.

ghost commented 4 years ago

The use case referenced to justify degenerate parameter values is likely erroneous, an absorbing state can be modelled by having all outgoing paths loop back to the same state. In that case any distribution would work, unless one is interested in statistics about how many times a transition, including transitions to the same state, occurred.

I disagree with the proposed solution and agree with the Wikipedia page that the case is degenerate for a reason. To argue this, mathematical infinity is more often than not not an actual element of the probability space but a statement of what happens when a limit is taken. If a distribution is dependent on a parameter α then the function can be seen as depending on two variables p(α,x). If any IEEE infinities or other degenerate cases are to be allowed, then one would have to either

  1. treat all possible permutations of taking the limit and find a way to query the library for each of them, as they're unequal in general
  2. agree to take the limit in one specific order
  3. some mixture of 1. and 2.

Additionally to these complications, as stated above, either cdf can exist but pdf be a non-classical function as is the case with a discrete probability or both limits exist but aren't cdf's and pdf's anymore. If degenerate cases are allowed where not all the functions exist in a classical sense there's an argument to be made that returning errors is more helpful than approximating a delta function with p(x0) = infinity and for all other x p(x) = 0

ghost commented 4 years ago

Perhaps I got lost in the mathematics on this one, I take some of that back. If there's only one parameter apart from the variable of interest then I agree, More general Exp is a reasonable extension. In the case of the gamma function however there are two parameters, so knowing in which order to take the limit doesn't have an equally reasonable default. What also simplifies the resolution in the rand crate is that there no evaluation of pdf is sought, only a consistent extension of sampling.

dhardy commented 4 years ago

Thanks for the analysis and follow up comment. I agree: the exp case seems unambiguous, but in general that is not the case. It may be that allowing for degenerates in only one variable is also a sensible option (but I haven't thought enough about the gamma case).

saona-raimundo commented 4 years ago

Author of the More general Exp issue here.

What also simplifies the resolution in the rand crate is that there no evaluation of pdf is sought, only a consistent extension of sampling.

Exactly!! This is what I had in mind. In the case of Exp, it would be an expected behaviour to sample infinity.

In the other hand, a Gamma with both parameters infinity has no intuitive behaviour. Having said that, one should either document the chosen behaviour (as it has been done in statrs) or not support it to start with.

I think we'd have to evaluate this on a case-by-case basis for distributions

I totally agree with that, but not only because there can be an intuitive answer for some cases but also because there can be useful (and arbitrary) implementations. Just as an example: assert_eq!(std::f64::INFINITY, 1.0 / 0.0) this behaviour simplified our code, but there is no correct answer to what 1 / 0 is.

ghost commented 4 years ago

Just as an example: assert_eq!(std::f64::INFINITY, 1.0 / 0.0) this behaviour simplified our code, but there is no correct answer to what 1 / 0 is.

As I outlined with the mathematical infinities that get confused with IEEE754 infinities, the semantics are not identical. This starts to get unintuitive because -0.0 == 0.0 which respects the mathematics but 1.0 / -0.0 != 1.0 / 0.0, so in some cases the floating point numbers are supposed to be interpreted as limits while in other contexts this isn't the case. So yes 1 / 0 does not have a sensible solution as a number satisfying 0 * a = 1, but that's not the angle to take. Floating point expressions can either be made sensible as numbers, made sensible as limits or neither of those, which leads to the at first unintuitive property of NaNs not being equal to themselves. That's akin to the situation with functions here.

dhardy commented 4 years ago

Yes, but you missed the bit about convergence: 1/x → +inf as x → 0 from above. "INF" is not a real mathematical value, but a model of what you observe when approaching a degenerate value.

ghost commented 4 years ago

Based on the discussion my current proposal would be

I think an error is preferable to NaN on the last point, but this is at least not immediately desirable due to to being backwards-incompatible.

saona-raimundo commented 4 years ago

I agree with these principles and would add the following:

I am not sure the consequences of following these premises, but could start with something:

  1. Gamma distribution should not accept both parameters (alpha and beta) as infinity. But, it should allow one of them to be infinity.
  2. Normal distribution should not accept mean infinity and std_dev infinity at the same time. But, should allow: mean infinity and std_dev finite (returning infinity always); mean finite and zero std_dev (returning a constant equal to mean always); mean finite and infinity std_dev should not be allowed, as it does not converge in a reasonable way.
  3. ...

Of course, the list can be longer and implementing all these changes might not be what we want(?) The possibility is to implement them "as it is requested".

ghost commented 4 years ago
  • If a multi-parameter case has defined behaviour, there should not be a problem in adding it.

I agree, caveats below, especially since errors are returned when constructing the distributions, it's an immediate feedback on what to expect.

  1. Normal distribution should not accept mean infinity and std_dev infinity at the same time. But, should allow: mean infinity and std_dev finite (returning infinity always); mean finite and zero std_dev (returning a constant equal to mean always); mean finite and infinity std_dev should not be allowed, as it does not converge in a reasonable way.

This is at odds with the solution in the rand crate. Finite mean and infinity std_dev do not make the distribution converge towards any other, but neither did the exponential distribution. This would deny extending some properties of the distribution in a meaningful way, like the mean and the std_dev, which do have sensible limits. More general Exp does not care whether the distribution converges, only if one selected property can be extended consistently. I would even argue that using the logic used to extend the exponential clock one would conclude that the limit of sampling of a normal distribution with mean 0 and infinite variance is infinity with probability 1/2 and -infinity with probability 1/2, so does have a limit.

The possibility is to implement them "as it is requested".

Strongly disagree. We should model the established mathematical practices as closely as possible. Convergence of random variables have formal methodologies for characterising random variables are degenerate cases of each other. I thus reverse my reversal on the issue and think more general Exp is a bad idea due to selectively extending properties derived from a distribution that cannot formally be shown to converge against any distribution. It breaks mathematics in subtle ways. The principled approach would be to pick one of the convergence metrics and extend to degenerate cases if such a limit in the chosen metric exists, and deny deriving any properties from inexistent distribution limits.

Edit: big rewrite for clarification purposes

saona-raimundo commented 4 years ago

I agree, caveats below, especially since errors are returned when constructing the distributions, it's an immediate feedback on what to expect.

Yes, indeed! and any function that can fail should return a Result, that is the idea :)

Finite mean and infinity std_dev do not make the distribution converge

Exactly, that is why I proposed to not accept this case either:

mean finite and infinity std_dev should not be allowed, as it does not converge in a reasonable way.

Convergence

More general Exp does not care whether the distribution converges, only if one selected property can be extended consistently.

Maybe it was not so explicit, but the convergence we considered to solve More general Exp was "the intuitive behaviour a user would expect", which I think is still the correct thing to do.

To translate into a mathematical convergence one needs a sequence of random variables. Sadly, there is no "good sequence" to consider. Let's say we want to claim something like "Exp(0) is the almost sure limit of a sequence of Exp(lambda) when lambda goes to zero". One can prove that the almost sure convergence depends on the sequence of lambdas we consider. Therefore, there is no way we can put a criteria like "let's accept the limit when it corresponds to the almost sure convergence". (see Appendix)

Intuitive convergence

If intuitive convergence is not almost sure convergence, what should it be? As you point out, the convergence of distributions is not a "good" convergence either, since More general Exp would not be right in this case.

I think that, if there is a characterization of Exp(lambda) = Exp(1) / lambda, then it is easy to agree in the limit when lambda goes to zero. Basing the limit cases in characterizations like this is, of course, relative to the particular characterization and the limit might "break" under other characterizations.

Usually, such a characterization is the one used when simulating the variable, so in the context of the rand crate is easy to get an agreement. Maybe, if such a characterization is also part of the documentation in statrs, then the expected behaviour will also be "intuitive".

Practicality

The possibility is to implement them "as it is requested".

Strongly disagree. We should model the established mathematical practices as closely as possible.

This is purely from a practical point of view. Of course one would like the crate to be as complete as possible.

Appendix

Consider a sequence of variables X_n with distribution Exp(lambda_n). We will show that, depending on the sequence (lambda_n)_{n > 0}, we can have that (X_n) converges almost surely to infinity or not.

Proof:

Case 1: not convergent. First, consider a decreasing sequence (alpha_n) that goes to zero. The idea is to take increasingly more copies of alpha_n so that we can ensure that there is no convergence to infinity. Consider alpha_1, then, take m_1 copies of an Exp(alpha_1) so that the probability that the minimum among these variables is less than one is greater than (0.9)^{1 / 2^n}. Then go to alpha_2 and do the same to define a new number m_2, and so on for each alpha_n define m_n.

Consider the sequence (lambda_n) as the decreasing sequence of all alphas considered in the previous step, with the number of repetitions considered. The idea is that the sequence (Exp(lambda_n)) has so many chances to be below one, that the liminf of the sequence is not infinity and therefore its limit either. Indeed, consider n such that lambda_{n+1} < lambda_n, i.e. just after "changing to a lower alpha". Then, the probability that in the last m_n variables the minimum is less than one is at least (0.9)^{1 / 2^n}. Therefore, the liminf is less than one with probability one.

Case 2: convergent.

Consider a sequence of Exponential variables X_n with distribution Exp(1 / n^2). We need to prove that the probability of lim_{n \to \infty} X_n = \infty is one. Equivalently, we can prove that, for all K > 0, we have that the probability of liminf_{n \to \infty} X_n > K is one.

Define the event E_n = {w : X_n(w) <= K}. Note that (limsup E_n)^c is the event liminf X_n > K. Let us prove that the probability of limsup E_n is zero. The probability of E_n is 1 - e^{-K / n^2}, which is less than K / n^2 (linear expansion of exponential around zero). Using Borel-Cantelli's lemma, since the sum over the probability of E_n is finite, we have that the probability of limsup E_n is zero, which means that the probability of liminf X_n > K is one. Since K is arbitrary, we conclude that the liminf of (X_n) is infinity and therefore the sequence converges to infinity almost surely.

dhardy commented 4 years ago

the limit of sampling of a normal distribution with mean 0 and infinite variance is infinity with probability 1/2 and -infinity with probability 1/2, so does have a limit

This is still divergence, so I don't think it's valid to say that it has a limit.

The possibility is to implement them "as it is requested".

In practice, many things don't get implemented until requested. But if we can tackle degenerate cases comprehensively, that would be great. (This also implies documenting why certain degenerate cases should not return a solution.)

Let's say we want to claim something like "Exp(0) is the almost sure limit of a sequence of Exp(lambda) when lambda goes to zero". One can prove that the almost sure convergence depends on the sequence of lambdas we consider.

Example? The distribution is only (originally) defined for on positive parameters. Ah, this is "Case 1: not convergent." This is an (informal) proof that for any λ > 0, there is some ε > 0 such that for X in E(λ), P(X < 1) > ε. This is a very weak statement, especially considering the inherent limitations of computing (maximum number of samples, limits on precision of common data types).

It should be clear that for any fixed ε > 0 and k > 0, there exists some λ > 0 such that for X in E(λ), P(X < k) < ε. Is this not enough to say, for any fixed N, lim{λ→0}[ min(N samples of X from E(λ)) ] > K where K is our largest finite representable number? (Of course if N is unbounded this does not hold, but for any application within computing N must be bounded.)

ghost commented 4 years ago

To translate into a mathematical convergence one needs a sequence of random variables. Sadly, there is no "good sequence" to consider. Let's say we want to claim something like "Exp(0) is the almost sure limit of a sequence of Exp(lambda) when lambda goes to zero". One can prove that the almost sure convergence depends on the sequence of lambdas we consider. Therefore, there is no way we can put a criteria like "let's accept the limit when it corresponds to the almost sure convergence". (see Appendix)

Yes, that was exactly what I was trying to say. I agree that it does not converge almost surely (although I disagree with the proof), which is why there's a mismatch between rand and statrs. It seems to me we can agree that the returning of errors upon calling T::new() for degenerate cases means there is some notion of convergence, ideally I would think a formal one, for which such a limit does not exist. This prevents statrs from calling sample on such a function, as it returns an error for Exponential::new(0.0). Nit-pick: When a parameter is real the definition of convergence is it converges no matter what discrete parameter sequence converging towards the degenerate value is chosen.

It boils down to two questions:

  1. Should statrs and rand be consistent?
  2. Should one care about lim z→z' H(pdfz) or lim z→z' pdfz for degenerate z' values, where H is a function dependent on pdf?

@dhardy

This is still divergence, so I don't think it's valid to say that it has a limit.

I'm aware this is not a formal argument. I was using the intuitive version of convergence to sample how far one is willing to take this.

This is a very weak statement, especially considering the inherent limitations of computing (maximum number of samples, limits on precision of common data types).

I'm not sure where this is heading. Are you suggesting to implement different library functions for f32,f64...?

dhardy commented 4 years ago

Are you suggesting to implement different library functions for f32,f64...?

Actually, yes, but not now. It's also besides the point. The point is that for any situation using one of these libs, there is some limit on precision and some limit on the number of samples. It doesn't matter what those limits are, only that they exist; given that, the argument above is that for Exp(0), all observable samples should be +INF.

I haven't really considered other distributions, yet.

For Norm(μ, σ²), one gets into problems once σ² gets close to the limits of precision. Since a sample is generated as self.mean + self.std_dev * n, FP arithmetic may overflow to ±INF even when all parameters are finite. To prevent that we'd need some arbitrary bound on std_dev much smaller than INF. Note that the situation for norm = 0, std_dev = INF is not that different than for std_dev finite but close to INF: both may generate +INF and -INF values (although the latter case can also generate finite values). Whether or not std_dev = INF is allowed, the user has to be able to handle INF outputs if parameters may approach the limits of FP range. And NaN can be output if (1) mean and std_dev are both non-finite or (2) if std_dev is non-finite and the standard-normal sample n = 0. There is also an issue that if mean and std_dev are both close to the maximum finite value, then -INF values may be generated almost as often as +INF values (since when n × std_dev evaluates to -INF, adding a large but finite mean value does nothing).

The above analysis is to say that however we choose to solve this problem, (a) case-by-case analysis will be needed and (b) one cannot consider this a pure-maths problem.

Perhaps we should start by asking questions like:

saona-raimundo commented 4 years ago

When a parameter is real the definition of convergence is it converges no matter what discrete parameter sequence converging towards the degenerate value is chosen.

The issue with this approach is that we are not dealing with real numbers, we are dealing with floating points (or finite precision numbers), which include infinity. Therefore, one should not focus so much in the math here, and more in the actual numbers on the computer. Note: Even with arbitrary precision, the source of randomness needs to have a finite precision, so there is no such thing as arbitrary precision while simulating continuous random variables.

INF is useful, for example, to convey: "precision limit overflow". Returning NaN would express too little of what is actually happening and, in some situations, +INF or -INF can be used by the user.

(a) case-by-case analysis will be needed and (b) one cannot consider this a pure-maths problem.

Totally agree.

Is it reasonable to generate ±INF values?

Well, you know my answer. It boils down to "it can be more useful to an user than an error", especially since INF can be already obtained (recall finite precision), as @dhardy pointed out.

Is it reasonable to generate NaN values?

I think it would be strange for a user to get a NaN value. I do not know of any "reasonable" case in which it is the right answer, but there might be.

If I get NaN while doing computations, it is because either (1) I went beyond the precision, or (2) some computation can actually fail and I should replace it with another.

For me, NaNs are like errors, and we should try to document them if there is a possibility of returning them.

ghost commented 4 years ago

Since a sample is generated as self.mean + self.std_dev * n, FP arithmetic may overflow to ±INF even when all parameters are finite. To prevent that we'd need some arbitrary bound on std_dev much smaller than INF. [...] There is also an issue that if mean and std_dev are both close to the maximum finite value, then -INF values may be generated almost as often as +INF values (since when n × std_dev evaluates to -INF, adding a large but finite mean value does nothing). [...] one cannot consider this a pure-maths problem

If evaluating an expression in infinite precision and range would yield a value that should not overflow/NaN when the result is expressed in fp, for example (f64::MAX + 1.0) - 1.0, yet the fp calculation yields overflow/NaN, then this is due to intermediate rounding/overflow errors propagating and it is the job of numerical analysis to find an expression that can be accurately evaluated in fp. There's no disagreement in that. If finding different expressions that overflow less for the given range is your goal, like (mean + std_dev / 2.0 * n) + std_dev / 2.0 * n where mean is close to -inf, std_dev is close to inf and n is positive, I would suggest filing a separate issue, as this conversation is getting difficult to follow.

OP was about what to do with point-blank degenerate inputs and to figure out what the sensible result should be, given the expressed interest in expanding the domain of the methods defined in statrs. Established practice in this library is to take the limit whenever mathematical functions have to be evaluated at degenerate values, which is why every pdf evaluates to 0 at inf. Otherwise not taking mathematical limits would lead to NaN values, like with the pdf of gamma: NaN = inf * 0 = beta^alpha / Gamma(alpha) * inf^(alpha - 1) * exp(-beta * inf) = pdf(inf). I was under the impression that this was not controversial, and thus suggested how to extend T::new() to degenerate values for distributions T. Since the object constructed by T::new() is a probability measure, I suggested the extension of T::new() to degenerate values be the analogue of the limit on real numbers, which is a limit in functions. My proposed solution to treat point-blank degenerate values which have no issues of fp precision is to treat them as limits, which has the following properties

  1. it's mathematically consistent
  2. it has a precedent in this library
  3. it agrees with fp evaluations for simple functions (1/inf = 0)
  4. it's sometimes the only way to avoid NaNs when extending the domain (x * ln(x) at x = 0)

This conversation has switched rather abruptly from "what to do with degenerate inputs" to "what inputs to consider degenerate" and "what are reasonable outputs". Since no effort was made to answer my questions about whether rand and statrs should be consistent and what limit to care about, I cannot say whether this is accidental. In any case, fp precision doesn't change what to do with degenerate inputs, only what inputs to consider degenerate. Taking into account the fp precision means any checks x == INFINITY should be changed to x > f64::MAX and any x == a for normal a changed to (x - a).abs() < f64::EPSILON.

My answers to your questions are

Questions of my own are

The issue with this approach is that we are not dealing with real numbers, we are dealing with floating points (or finite precision numbers), which include infinity. Therefore, one should not focus so much in the math here, and more in the actual numbers on the computer.

I'm afraid I don't understand what you're saying. If I give you exactly inf as an input and ask you to make sense of it, something I cautioned in OP against doing, you've left the place where you have a reasonable expectation of fp giving you anything sensible. Occasionally this might happen, like 1/inf = 0, but often you'll get expressions like inf / inf, inf - inf, 0 * inf, where the crude rules in fp will give NaN because there's no limit for such expressions that's universally true in mathematics. If you evaluate ln(x) / x for any sufficiently large x that doesn't overflow, it will be a roughly decreasing function modulo some fp rounding perturbations, until you hit a value such that x overflows and the expression gives NaN. If the limit is not taken, what do you expect this example or the pdf of gamma to yield at inf? Gamma currently gives 0, a perfectly sensible and reasonable answer, motivated by pure math. If you're refusing to accept the answer provided by pure math after you've opted to go beyond what fp is reasonably capable of, yet want some value that's not NaN, I don't know what you're looking for.

dhardy commented 4 years ago

If finding different expressions that overflow less for the given range is your goal

It's not my goal — I was merely pointing out other limitations, but as you say, it's a bit off topic.

My proposed solution to treat point-blank degenerate values which have no issues of fp precision is to treat them as limits

Where a mathematical limit exists, this is fine. The problem that came up was @rasa200's "proof of non-convergence" of Exp(0). Whether this proof is valid depends on the definition of convergence. Going back to these definitions... okay, we should skip to your question about PDFs.

what you think Exponential(0.0).unwrap().pdf(4.0) should be

It should be clear that lim{x→0} Exp(x).PDF is 0 everywhere. In this case the CDF is also zero everywhere, and hence continuous. The inverse CDF cannot be defined, but other than this I believe there is no issue.

Of course, one has to be careful; e.g. Wikipedia gives an example with discontinuous CDF which may be more of an issue.

This thus goes back to the question of which definition of convergence should be used?

whether you think statrs and rand should be consistent

Good question. I don't know. Rand does not implement CDF or PDF and probably won't ever do so, and tends to consider speed quite important (though we still haven't resolved whether to include high-precision alternatives). Even so, I don't believe there is a reason to choose different approaches to degenerate cases. @vks may also have an opinion.


Some reasonable effort should be made to avoid inf as an output if possible, by finding a new expression that does not overflow but accurately represents the mathematical function that is to be approximated. At some point you reach diminishing returns.

Usually only inputs close to FP limits can cause this. I don't believe we should care very much about this case, at least not when the only error is to generate obvious error/degenerate values like NaN and INF.

any checks x == INFINITY should be changed to x > f64::MAX

I believe the two are equivalent (there are only two infinity values).

any x == a for normal a changed to (x - a).abs() < f64::EPSILON

That's wrong (doc). In any case, one does not usually check direct equality, except at degenerate values, where this is not an issue.

ghost commented 4 years ago

It should be clear that lim{x→0} Exp(x).PDF is 0 everywhere. In this case the CDF is also zero everywhere, and hence continuous. The inverse CDF cannot be defined, but other than this I believe there is no issue.

You don't think it's awkward to allow returning a pdf without signaling an error, even though this mechanically extended pdf does not have the properties of a pdf, like integrating to 1?

What do you think should Gamma(1,inf).max() be? The limits do not commute. Taken as a limit in the almost-sure sense (limz->inf Gamma(1,z)).max() = 0, while limz->inf (Gamma(1,z).max()) = inf. The generalisation of exp in the rand crate deals with the latter limit. I argue the former limit should be used in statrs.

I think the edge case needed by @rasa200 can be dealt with by providing a Delta distribution that returns a constant when sampled, and making the constant inf.

If the limit of the distribution does not exist in the a.s. sense, statrs should return errors.

any checks x == INFINITY should be changed to x > f64::MAX

I believe the two are equivalent (there are only two infinity values).

There are some non-obvious things going on with IEEE754 which make me uncertain whether this is the case [0]. I think they're not equivalent. Doing the latter comparison I believe protects against these oddities, hopefully.

any x == a for normal a changed to (x - a).abs() < f64::EPSILON

That's wrong (doc).

Thanks for pointing that out. Absolute difference is of course rubbish. Approx seems to be an implementation of what is needed.

[0] https://arxiv.org/abs/cs/0701192

dhardy commented 4 years ago

You don't think it's awkward to allow returning a pdf without signaling an error

I guess @boxtown should answer that since he's the maintainer of this lib. Yes, it is unexpected that the PDF integrates to 0 and the CDF is never 1, on the other hand it is still possible to implement the API according to its expectations (aside from those implicit ones on the properties of a PDF/CDF).

I would never expect numerical methods to be able to accurately quantify the integral of a PDF or inverse a CDF (without also providing bounds on the region of interest, which here obviously don't exist), hence this may not be a problem in practice. I don't know whether this is too awkward.

YeungOnion commented 2 months ago

I'd like to get the degenerate cases removed and documented.

I hadn't noticed this discussion, but I've got an open PR for ensuring that Uniform is constructed on a bounded interval for this aspect of, "it is not easily clear what the behavior should be" to a user, so we will do our best to not answer such questions as a user can choose what works for them. I also extended the StatsError to have an enum for rejection based on finiteness.

YeungOnion commented 2 months ago

Want to get the release out in hopes to get more feedback on what users expect on these cases.