giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
858 stars 175 forks source link

Cannot understand what CubicalCover kind='balanced' does #655

Closed bioshot-dotcom closed 1 year ago

bioshot-dotcom commented 1 year ago

I can't figure it out what CubicalCover(kind='balanced') does, any suggestions?

ulupo commented 1 year ago

Hi!

This is the relevant portion of the docstring (not sure whether you had already seen it): https://github.com/giotto-ai/giotto-tda/blob/7b3e47d7debd48730dc96b49d39dce300625d793/gtda/mapper/cover.py#L35-L39

Let us know if you still have questions.

bioshot-dotcom commented 1 year ago

Thank you, yes actually I had already seen it, and I had also seen this:

Notes

In the case of a balanced cover, :meth:`left_limits_` and
:meth:`right_limits_` are computed as follows given a training array `X`:
first, entries in `X` are ranked in ascending order, starting at 1 and
with the same rank repeated in the case of equal values; then, the closed
interval :math:`(0.5, N + 0.5)`, where :math:`N` is the maximum
rank observed, is covered uniformly with parameters `n_intervals` and
`overlap_frac`, yielding intervals :math:`(\\alpha_k, \\beta_k)`;
the final cover is made of intervals :math:`(a_k, b_k)` where, for
:math:`k > 1` (resp. :math:`k < ` `n_intervals`), :math:`a_k` (resp.
:math:`b_k`) is the value of any entry in `X` ranked as the floor (
resp. ceiling) of :math:`\\alpha_k` (resp. :math:`\\beta_k`).

So for example if my entry is X=[1,1,2,3,3,5,7,7,8,9,9,18,27] and the cover kind is uniform with defined n_intervals and overlap_frac my intervals will be x1=[1,1,2], x2=[2,3,3], x3=[3,5,7], x4 = [7,7,8], x5 = [8,9,9] x6= [9,18,27]. Which are the intervals in case of kind='balanced'?

ulupo commented 1 year ago

Hi! I apologize for the slow reply. Here is your example and the and intervals computed by the cover:

from gtda.mapper import OneDimensionalCover

n_intervals = 6
overlap_frac = 0.2
cover = OneDimensionalCover(kind='balanced',n_intervals=n_intervals, overlap_frac=overlap_frac)

X = np.array([1, 1, 2, 3, 3, 5, 7, 7, 8, 9, 9, 18, 27])

cover.fit(X)
y = cover.transform(X)
print(f"- Cover:\n{y}")

print(f"- Left limits of each cover interval: {cover.left_limits_}")
print(f"- Right limits of each cover interval: {cover.right_limits_}")
- Cover:
[[ True False False False False False]
 [ True False False False False False]
 [ True  True False False False False]
 [False  True False False False False]
 [False  True False False False False]
 [False False  True False False False]
 [False False  True  True False False]
 [False False  True  True False False]
 [False False False  True False False]
 [False False False False  True False]
 [False False False False  True False]
 [False False False False  True  True]
 [False False False False False  True]]
- Left limits of each cover interval: [-inf   1.   3.   5.   8.   9.]
- Right limits of each cover interval: [ 3.  5.  8.  9. 27. inf]

The left_limits_ and right_limits_ attributes give you the open interval which produces the cover represented as a boolean array y. The ith column of y tells you what elements of X are in the ith cover set ("interval"), as follows:

for i in range(n_intervals):
    print(f"Cover set {i}: {X[y[:, i]]}")
Cover set 0: [1 1 2]
Cover set 1: [2 3 3]
Cover set 2: [5 7 7]
Cover set 3: [7 7 8]
Cover set 4: [ 9  9 18]
Cover set 5: [18 27]

As you can see, it does what it says on the "cover" (i.e. in the docstring): "approximately the same number of unique values from X is contained in each cover interval." In this case, 2 unique values from X are mapped to each cover interval.