giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
858 stars 175 forks source link

Add TopologicalVector #493

Closed gtauzin closed 5 months ago

gtauzin commented 4 years ago

Reference issues/PRs

Types of changes

Description Add TopologicalVector.

Screenshots (if appropriate)

Any other comments?

Checklist

ulupo commented 4 years ago

@gtauzin I had never seen this before! It reminds me of the fact that I once thought it would be a good idea to have a sorting transformer to sort persistence pairs by persistence, so as to have a canonical order that would make it a bit meaningful to use a neural network directly on the diagrams. But I read somewhere that this does not work so well "in practice" (though I've never tried and am a bit skeptical). Anyway, this one is cool too.

ulupo commented 4 years ago

@gtauzin I'm thinking that this transformer can also be seen as a "representation" and not as a "feature generator". Of course the boundary has always been blurry, but I was under the impression that the unspoken rule is that we call "features" only the scalar features (more precisely, we allow at most one scalar per homology dimension). In that case, it would seem that TopologicalVector better belongs among BettiCurve and the like. What do you think?

gtauzin commented 4 years ago

@gtauzin I'm thinking that this transformer can also be seen as a "representation" and not as a "feature generator". Of course the boundary has always been blurry, but I was under the impression that the unspoken rule is that we call "features" only the scalar features (more precisely, we allow at most one scalar per homology dimension). In that case, it would seem that TopologicalVector better belongs among BettiCurve and the like. What do you think?

That's an interesting point. one could argue that ComplexPolynomial is in the same situation.

If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector and ComplexPlolynomail would then not qualify as representations. What do you think about this definition?

gtauzin commented 4 years ago

We should allow here n_distances to be a list as for ComplexPolynomial #479.

ulupo commented 4 years ago

If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector and ComplexPlolynomail would then not qualify as representations. What do you think about this definition?

IMO, we say "representations" as a shorthand for "vector representations" which in turn is a perfect synonym for "vectorizations". So the general gist for me is that these are ways for each persistence diagram to be made into high-dimensional vectors. If the vector space structure is relevant, e.g. if one can use the Euclidean distance, cosine distance, Euclidean inner product, etc. to get meaningful quantities, then in my mind we are firmly in the realm of representations. In the end, the boundary is pretty blurred I guess. To avoid getting too philosophical, I thought that having a stronger divide (one feature per hom dim vs multi-dimensional vectors per hom dim) would make our life easier, but it's not that important to me (and the user only sees the difference in the API reference, not in import statements).

My personal preference in general would be to not tie the characterization to visualization, because then one can claim anything can be visualized and how useful that really is uncomfortably subjective for me.

ulupo commented 4 years ago

We should allow here n_distances to be a list as for ComplexPolynomial #479.

This is now possible following #502. I'll make the change.

gtauzin commented 4 years ago

If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector and ComplexPlolynomail would then not qualify as representations. What do you think about this definition?

IMO, we say "representations" as a shorthand for "vector representations" which in turn is a perfect synonym for "vectorizations". So the general gist for me is that these are ways for each persistence diagram to be made into high-dimensional vectors. If the vector space structure is relevant, e.g. if one can use the Euclidean distance, cosine distance, Euclidean inner product, etc. to get meaningful quantities, then in my mind we are firmly in the realm of representations. In the end, the boundary is pretty blurred I guess. To avoid getting too philosophical, I thought that having a stronger divide (one feature per hom dim vs multi-dimensional vectors per hom dim) would make our life easier, but it's not that important to me (and the user only sees the difference in the API reference, not in import statements).

My personal preference in general would be to not tie the characterization to visualization, because then one can claim anything can be visualized and how useful that really is uncomfortably subjective for me.

From a purely practical perspective, I feel like a transformer is a "feature generator" if what it outputs is a 2D array (n_samples, n_features) and it is a "representation" if it outputs a intermediate data structure (>3D arrays). If it is a "representation", it means that there are ways to extract interesting meaningful features from it (and we should provide them) and or it helps to visualize it to understand better the data.

Moving TopologicalVector and the like to representations.py does not have much consequences anyways (but in that case, we should have a separate version that is in feature.py when n_distances=1 xD), so I won't fight for it. But it is weird to me to think that TopologicalVector is a representation is strange. I would say, just make it output a 3D array with homology dimension as an axis just for consistency with the rest. But this does not make sense as it is better to be able to specify the n_distances per homology dimension.

ulupo commented 4 years ago

@gtauzin thanks for the patience and for the suggestions. I'm happy with the practical criterion you mentioned, that anything outputting 2D arrays is a feature generator. So let's keep both transformers here!

CLAassistant commented 3 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: ulupo
:x: Guillaume Tauzin


Guillaume Tauzin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.