sktime / skpro

A unified framework for tabular probabilistic regression, time-to-event prediction, and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
247 stars 45 forks source link

[ENH] inspectable set-valued domains for distributions #244

Open fkiraly opened 7 months ago

fkiraly commented 7 months ago

It would be useful if distributions had some degree of inspectability with respect to their domains (sets), use cases include:

Some discussion has already taken place here: https://github.com/VascoSch92/sequentium/issues/46 also regarding possible ways to implement this.

Options discussed:

Some issues from skpro architecture which may not be obvious how to cover:

fkiraly commented 7 months ago

FYI @VascoSch92

VascoSch92 commented 6 months ago

I will start to work on a first version of a module for symbolic representation of sets.

The idea is to extend from BaseObject.

I still don't get 100% what it would be the application in skpro.distributions, but I will try to mimic the API of set6

I will open a draft PR as soon as I have something interesting. In this way we can discuss the code.

fkiraly commented 6 months ago

Great!

So you think it's better to inherit from BaseObject than using the existing logic in sympy?

If I may ask, what are your pros/cons and weighting? Just curious.

VascoSch92 commented 6 months ago

Actually I was playing a little bit with the set implementation of sympy and I have to admits that it is pretty nice.

One could think to use that and extend it to implement the measure of a set and integral computations.

However, adding sympy to the dependencies can be over over-killing. Do we need so much power for the purpose of the project? Can we just install the specific module which takes care of sets?

Perhaps, clarifying the exact API needed to the project could lead to a decision. From what I understand, the main purpose of this module is to computed pdf and pmf for a distribution. To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

fkiraly commented 6 months ago

To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?

Basically, yes - that's the key requirement. Also, finite/discrete sets for distributions of arbitrary support, but that's basically already python set.

Yes, the "weighty dependency" argument is convincing. I'd agree it outweights the "do not reinvent the wheel" one, as it's going to be a small wheel (for now).

VascoSch92 commented 6 months ago

Why we don't try to solve the problem directly using the integration provided by scipy ?

fkiraly commented 6 months ago

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

fkiraly commented 6 months ago

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

VascoSch92 commented 6 months ago

I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.

yes but we cannot separate the two parts?

Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?

Yes

fkiraly commented 6 months ago

yes but we cannot separate the two parts?

What do you mean by that? Do you mean this as a suggestion, or a statement?

VascoSch92 commented 6 months ago

Let's say you have a mixed distribution. Can we separate the two parts (dense and discrete), compute integrals on both and then put them together?

fkiraly commented 6 months ago

yes, that is exactly my thinking. But for that, you'd need to represent pmf and pdf separately. For both, you'd need some representation of domain to set up the integration, which brings us to the topic of this issue.

VascoSch92 commented 6 months ago

Can you give a concrete example of a mixed distribution you would like to implement?

fkiraly commented 6 months ago

Sure, here are two:

VascoSch92 commented 6 months ago

Sorry but i still have problem to understand the clipped normal example.

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

fkiraly commented 6 months ago

We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?

Yes, the full support is continuous, but the distribution is mixed, so by the Lebesgue decomposition theorem we can decompose it in a non-trivial absolutely continuous part, a pure point part, and there's no singular part because those aren't distributions that we want to look at (😁)

The clipped normal has two supports, therefore:

Some confusion can be coming from the word "continuous", which is overloaded, as it could be used as a property or qualifier for

VascoSch92 commented 6 months ago

Ah ok now is much clear. Sorry I'm not an expert in probability theory :-(

In practical, you want to compose the normal distribution and the mass measure at 0 to have the clipped normal. In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

ok 👍 now it is clear... i can start to try coding something and see if it fit the needs of the package

fkiraly commented 6 months ago

In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.

Yes, exactly.

i can start to try coding something and see if it fit the needs of the package

What's your design, if I may ask?

I'd go with modifying BaseDistribution. You may also be interested in this refactor PR: https://github.com/sktime/skpro/pull/265

VascoSch92 commented 6 months ago

Basic idea: parent class Set which extend from BaseObject which is more or less an interface. Then two subclasses: one for intervalls (Interval) and one for discrete sets (Discrete (?)) as these are the two sets we need the most.

Then we have the following questions:

We can expose in the BaseDistribution class the domain of the probability.

I will try to open a draft/sketch PR for feedbacks and guidance as soon as possible :-)

fkiraly commented 6 months ago

We can expose in the BaseDistribution class the domain of the probability.

Agreed - we may have to distinguish domains for the discrete and the continuous part as well.

Then we have the following questions:

Regarding requirements:

VascoSch92 commented 5 months ago

After a very first draft for the module domain (see branch #326 ), we are ready to start introducing domains for distributions.

We will work on the branch #326 until a stable API is found. After that, we will merge into main.

To find a valid and stable API, domains should be introduced for at least on of the following distributions:

With domains, we want also to introduce the 2 new methods:

Questions:

  1. What is the expected API for pdf and pmf?
  2. Which new _tags should we introduce?
  3. Do we have at least one example already coded for every of the family of distributions above listed?
fkiraly commented 5 months ago
  1. What is the expected API for pdf and pmf?

The API is already specified - it has been introduced since 2.2.2, after you branched off. If you update from main, you should see the current specs in the docstring, in the BaseDistribution object.

  1. Which new _tags should we introduce?

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

  1. Do we have at least one example already coded for every of the family of distributions above listed?

Will reply with a list in the next post.

fkiraly commented 5 months ago

discrete distributions with finite support

Delta

discrete distributions with infinite support

Poisson

absolutely continuous distributions supported on a bounded interval

Beta, QPD_B or Uniform

absolutely continuous distributions supported on intervals of length 2π (directional distributions)

don't have that

absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )

LogNormal, Exponential

absolutely continuous distributions supported on the whole real line

Normal

absolutely continuous distributions with variable support

Uniform, QPD_S, and QPD_B have a support that depends on parameters - entries in an array distribution can have different support.

mixed discrete/continuous distributions

no "atomic" distribution of this type currently, but you can construct one using Mixture a discrete and continuous - hope that actually works as expected...

VascoSch92 commented 5 months ago

That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.

A question is also if we are interested to the domain of a distribuition or to the support. I think the second one is more interesting right?

fkiraly commented 5 months ago

I think the second one is more interesting right?

Yes, for the moment it is, given that all distributions - even discrete ones - have a support that embeds canonically into the reals. With a distinction on continuous and discrete part.