Open fkiraly opened 7 months ago
FYI @VascoSch92
I will start to work on a first version of a module for symbolic representation of sets.
The idea is to extend from BaseObject
.
I still don't get 100% what it would be the application in skpro.distributions
, but I will try to mimic the API of set6
I will open a draft PR as soon as I have something interesting. In this way we can discuss the code.
Great!
So you think it's better to inherit from BaseObject
than using the existing logic in sympy
?
If I may ask, what are your pros/cons and weighting? Just curious.
Actually I was playing a little bit with the set implementation of sympy
and I have to admits that it is pretty nice.
One could think to use that and extend it to implement the measure of a set and integral computations.
However, adding sympy
to the dependencies can be over over-killing. Do we need so much power for the purpose of the project? Can we just install the specific module which takes care of sets?
Perhaps, clarifying the exact API needed to the project could lead to a decision. From what I understand, the main purpose of this module is to computed pdf
and pmf
for a distribution. To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?
To this purpose, we just need to be able to represent subsets (discrete or not) of the real line (or of a finite dimensional real euclidean space). Correct?
Basically, yes - that's the key requirement.
Also, finite/discrete sets for distributions of arbitrary support, but that's basically already python set
.
Yes, the "weighty dependency" argument is convincing. I'd agree it outweights the "do not reinvent the wheel" one, as it's going to be a small wheel (for now).
Why we don't try to solve the problem directly using the integration provided by scipy ?
I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.
Should we try with scipy
? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?
I don't think that works for mixed distributions, i.e., mix of deltas and (abs) continuous? I'd still think you need some explicit representation of the discrete part.
yes but we cannot separate the two parts?
Should we try with scipy? Plus some custom logic where we get to mixed distr. Might be the option with the best trade-off?
Yes
yes but we cannot separate the two parts?
What do you mean by that? Do you mean this as a suggestion, or a statement?
Let's say you have a mixed distribution. Can we separate the two parts (dense and discrete), compute integrals on both and then put them together?
yes, that is exactly my thinking. But for that, you'd need to represent pmf and pdf separately. For both, you'd need some representation of domain to set up the integration, which brings us to the topic of this issue.
Can you give a concrete example of a mixed distribution you would like to implement?
Sure, here are two:
Mixture
to Normal
and Empirical
.Sorry but i still have problem to understand the clipped normal example.
We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?
We have the max of two continuous functions, therefore should we have a continuous support for pdf and cdf not?
Yes, the full support is continuous, but the distribution is mixed, so by the Lebesgue decomposition theorem we can decompose it in a non-trivial absolutely continuous part, a pure point part, and there's no singular part because those aren't distributions that we want to look at (😁)
The clipped normal has two supports, therefore:
Some confusion can be coming from the word "continuous", which is overloaded, as it could be used as a property or qualifier for
Ah ok now is much clear. Sorry I'm not an expert in probability theory :-(
In practical, you want to compose the normal distribution and the mass measure at 0 to have the clipped normal. In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.
ok 👍 now it is clear... i can start to try coding something and see if it fit the needs of the package
In the instance of the normal distribution you give the support and same for the mass measure, right? from that you can compute what you need.
Yes, exactly.
i can start to try coding something and see if it fit the needs of the package
What's your design, if I may ask?
I'd go with modifying BaseDistribution
. You may also be interested in this refactor PR: https://github.com/sktime/skpro/pull/265
Basic idea: parent class Set which extend from BaseObject
which is more or less an interface. Then two subclasses: one for intervalls (Interval
) and one for discrete sets (Discrete
(?)) as these are the two sets we need the most.
Then we have the following questions:
We can expose in the BaseDistribution
class the domain
of the probability.
I will try to open a draft/sketch PR for feedbacks and guidance as soon as possible :-)
We can expose in the
BaseDistribution
class the domain of the probability.
Agreed - we may have to distinguish domains for the discrete and the continuous part as well.
Then we have the following questions:
Regarding requirements:
skpro
distributions are "array distributions", we probably need a comparable concept for sets. That makes it unusual, perhaps. E.g., take the Normal
or Uniform
from the example (Normal.create_test_instance
), it has a 2D range. The Normal
is supported over the reals, but the Uniform
may have different support per entry. Or is this too much for the start, and overdesign?
After a very first draft for the module domain
(see branch #326 ), we are ready to start introducing domains for distributions.
We will work on the branch #326 until a stable API is found. After that, we will merge into main
.
To find a valid and stable API, domains should be introduced for at least on of the following distributions:
With domains, we want also to introduce the 2 new methods:
pdf
- probability density functionpmf
- probability mass functionQuestions:
pdf
and pmf
?_tags
should we introduce?
- What is the expected API for pdf and pmf?
The API is already specified - it has been introduced since 2.2.2, after you branched off. If you update from main
, you should see the current specs in the docstring, in the BaseDistribution
object.
- Which new
_tags
should we introduce?
That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.
- Do we have at least one example already coded for every of the family of distributions above listed?
Will reply with a list in the next post.
discrete distributions with finite support
Delta
discrete distributions with infinite support
Poisson
absolutely continuous distributions supported on a bounded interval
Beta
, QPD_B
or Uniform
absolutely continuous distributions supported on intervals of length 2π (directional distributions)
don't have that
absolutely continuous distributions supported on semi-infinite intervals ( e.g., [0,∞) )
LogNormal
, Exponential
absolutely continuous distributions supported on the whole real line
Normal
absolutely continuous distributions with variable support
Uniform
, QPD_S
, and QPD_B
have a support that depends on parameters - entries in an array distribution can have different support.
mixed discrete/continuous distributions
no "atomic" distribution of this type currently, but you can construct one using Mixture
a discrete and continuous - hope that actually works as expected...
That's a good question. I was thinking about working with properties and attributes primarily, but now that you mention it, we may consider tags as well. I have no clear answer to this yet, input appreciated.
A question is also if we are interested to the domain of a distribuition or to the support. I think the second one is more interesting right?
I think the second one is more interesting right?
Yes, for the moment it is, given that all distributions - even discrete ones - have a support that embeds canonically into the reals. With a distinction on continuous and discrete part.
It would be useful if distributions had some degree of inspectability with respect to their domains (sets), use cases include:
Some discussion has already taken place here: https://github.com/VascoSch92/sequentium/issues/46 also regarding possible ways to implement this.
Options discussed:
scipy
Sets
, possibly alsostats
BaseObject
sklearn.utils
parameter checkingSome issues from
skpro
architecture which may not be obvious how to cover:BaseObject
supports parametric objects? Composites are supported by all three options above.