[ENH] inspectable set-valued domains for distributions

fkiraly commented 7 months ago

It would be useful if distributions had some degree of inspectability with respect to their domains (sets), use cases include:

approximation defaults, e.g., integer domain distributions know their domain, so expectations can be computed as sums
typing for distributions, e.g., discrete, continuous, mixed
inspecting the support of the discrete (deltas) part for mixed distributions, related to representing continuous and discrete parts separately: https://github.com/sktime/skpro/issues/229
generating useful defaults in plotting routines
testing distributions; sets may also be useful in specifying parameter ranges in estimators

Some discussion has already taken place here: https://github.com/VascoSch92/sequentium/issues/46 also regarding possible ways to implement this.

Options discussed:

using scipy Sets, possibly also stats
de-novo implementation, following BaseObject
something similar to sklearn.utils parameter checking

Some issues from skpro architecture which may not be obvious how to cover:

distributions are "tabular" (matrix distributions with pandas-like row and column index). Domains may vary over entries of the table.
we may need parameteric sets, though that is not certain. Afaik only BaseObject supports parametric objects? Composites are supported by all three options above.

fkiraly commented 7 months ago

FYI @VascoSch92

VascoSch92 commented 6 months ago

I will start to work on a first version of a module for symbolic representation of sets.

The idea is to extend from BaseObject.

I still don't get 100% what it would be the application in skpro.distributions, but I will try to mimic the API of set6

I will open a draft PR as soon as I have something interesting. In this way we can discuss the code.

fkiraly commented 6 months ago

Great!

So you think it's better to inherit from BaseObject than using the existing logic in sympy?

If I may ask, what are your pros/cons and weighting? Just curious.

VascoSch92 commented 6 months ago

Actually I was playing a little bit with the set implementation of sympy and I have to admits that it is pretty nice.

One could think to use that and extend it to implement the measure of a set and integral computations.

However, adding sympy to the dependencies can be over over-killing. Do we need so much power for the purpose of the project? Can we just install the specific module which takes care of sets?

Perhaps, clarifying the exact API needed to the project could lead to a decision. From what I understand, the main purpose of this module is to computed pdf and pmf for a distribution. To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

fkiraly commented 6 months ago

To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

Basically, yes - that's the key requirement. Also, finite/discrete sets for distributions of arbitrary support, but that's basically already python set.

Yes, the "weighty dependency" argument is convincing. I'd agree it outweights the "do not reinvent the wheel" one, as it's going to be a small wheel (for now).

VascoSch92 commented 6 months ago

Why we don't try to solve the problem directly using the integration provided by scipy ?

fkiraly commented 6 months ago

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

fkiraly commented 6 months ago

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

VascoSch92 commented 6 months ago

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

yes but we cannot separate the two parts?

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

Yes

fkiraly commented 6 months ago

yes but we cannot separate the two parts?

What do you mean by that? Do you mean this as a suggestion, or a statement?

VascoSch92 commented 6 months ago

Let's say you have a mixed distribution. Can we separate the two parts (dense and discrete), compute integrals on both and then put them together?

fkiraly commented 6 months ago

yes, that is exactly my thinking. But for that, you'd need to represent pmf and pdf separately. For both, you'd need some representation of domain to set up the integration, which brings us to the topic of this issue.

VascoSch92 commented 6 months ago

Can you give a concrete example of a mixed distribution you would like to implement?

fkiraly commented 6 months ago

Sure, here are two:

clipped normal, i.e., a random variable of form $\max (c, X)$ for a normal random variable $X$ and constant $c$.
mixture of empirical and normal, this can occur when applying Mixture to Normal and Empirical.

VascoSch92 commented 6 months ago

Sorry but i still have problem to understand the clipped normal example.

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

fkiraly commented 6 months ago

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

Yes, the full support is continuous, but the distribution is mixed, so by the Lebesgue decomposition theorem we can decompose it in a non-trivial absolutely continuous part, a pure point part, and there's no singular part because those aren't distributions that we want to look at (😁)

The clipped normal has two supports, therefore:

the absolutely continuous part has support $[0, \infty)$, the measure of the absolutely continuous part on this is 1/2
the pure point part has support $\{ 0 \}$, the mass (measure) of the pure point part on 0 is 1/2

Some confusion can be coming from the word "continuous", which is overloaded, as it could be used as a property or qualifier for

distributions/measures, as shorthand for "absolutely continuous" (wrt Borel sigma algebra on R)
distribution defining functions, such as the cdf or pdf. These being continuous bears no relevance here - but due to the overloading it is frequently confused.
sets, when a support or domain - sometimes used for Borel open sets or their closure, to differentiate from countable unions of points

VascoSch92 commented 6 months ago

Ah ok now is much clear. Sorry I'm not an expert in probability theory :-(

In practical, you want to compose the normal distribution and the mass measure at 0 to have the clipped normal. In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

ok 👍 now it is clear... i can start to try coding something and see if it fit the needs of the package

fkiraly commented 6 months ago

In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

Yes, exactly.

i can start to try coding something and see if it fit the needs of the package

What's your design, if I may ask?

I'd go with modifying BaseDistribution. You may also be interested in this refactor PR: https://github.com/sktime/skpro/pull/265

VascoSch92 commented 6 months ago

Basic idea: parent class Set which extend from BaseObject which is more or less an interface. Then two subclasses: one for intervalls (Interval) and one for discrete sets (Discrete (?)) as these are the two sets we need the most.

Then we have the following questions:

do we need union? direct product? intersection?
which method/properties we need? bounduary? mass/measure? interior?

We can expose in the BaseDistribution class the domain of the probability.

I will try to open a draft/sketch PR for feedbacks and guidance as soon as possible :-)

fkiraly commented 6 months ago

We can expose in the BaseDistribution class the domain of the probability.

Agreed - we may have to distinguish domains for the discrete and the continuous part as well.

Then we have the following questions:

Regarding requirements:

direct product: direct product in the form of array distributions, possibly - let me explain
the skpro distributions are "array distributions", we probably need a comparable concept for sets. That makes it unusual, perhaps. E.g., take the Normal or Uniform from the example (Normal.create_test_instance), it has a 2D range. The Normal is supported over the reals, but the Uniform may have different support per entry. Or is this too much for the start, and overdesign?
- union/intersection: I do not see where they would appear, but perhaps we want to think how to keep the design upwards compatible for this.
- methods: the measure is given by the probability distribution whose domain is the set, so the set itself may not need to have a measure attached to it. I am thinking where this could be useful, or other things like boundary and interior, but can't see a clear use case.

VascoSch92 commented 5 months ago

After a very first draft for the module domain (see branch #326 ), we are ready to start introducing domains for distributions.

We will work on the branch #326 until a stable API is found. After that, we will merge into main.

To find a valid and stable API, domains should be introduced for at least on of the following distributions:

[ ] discrete distributions with finite support
[ ] discrete distributions with infinite support
[ ] absolutely continuous distributions supported on a bounded interval
[ ] absolutely continuous distributions supported on intervals of length 2π (directional distributions)
[ ] absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )
[ ] absolutely continuous distributions supported on the whole real line
[ ] absolutely continuous distributions with variable support
[ ] mixed discrete/continuous distributions

With domains, we want also to introduce the 2 new methods:

pdf - probability density function
pmf - probability mass function

Questions:

What is the expected API for pdf and pmf?
Which new _tags should we introduce?
Do we have at least one example already coded for every of the family of distributions above listed?

fkiraly commented 5 months ago

What is the expected API for pdf and pmf?

The API is already specified - it has been introduced since 2.2.2, after you branched off. If you update from main, you should see the current specs in the docstring, in the BaseDistribution object.

Which new _tags should we introduce?

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

Do we have at least one example already coded for every of the family of distributions above listed?

Will reply with a list in the next post.

fkiraly commented 5 months ago

discrete distributions with finite support

Delta

discrete distributions with infinite support

Poisson

absolutely continuous distributions supported on a bounded interval

Beta, QPD_B or Uniform

absolutely continuous distributions supported on intervals of length 2π (directional distributions)

don't have that

absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )

LogNormal, Exponential

absolutely continuous distributions supported on the whole real line

Normal

absolutely continuous distributions with variable support

Uniform, QPD_S, and QPD_B have a support that depends on parameters - entries in an array distribution can have different support.

mixed discrete/continuous distributions

no "atomic" distribution of this type currently, but you can construct one using Mixture a discrete and continuous - hope that actually works as expected...

VascoSch92 commented 5 months ago

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

A question is also if we are interested to the domain of a distribuition or to the support. I think the second one is more interesting right?

fkiraly commented 5 months ago

I think the second one is more interesting right?

Yes, for the moment it is, given that all distributions - even discrete ones - have a support that embeds canonically into the reals. With a distinction on continuous and discrete part.

sktime / skpro

[ENH] inspectable set-valued domains for distributions #244