Closed gtauzin closed 5 months ago
@gtauzin I had never seen this before! It reminds me of the fact that I once thought it would be a good idea to have a sorting transformer to sort persistence pairs by persistence, so as to have a canonical order that would make it a bit meaningful to use a neural network directly on the diagrams. But I read somewhere that this does not work so well "in practice" (though I've never tried and am a bit skeptical). Anyway, this one is cool too.
@gtauzin I'm thinking that this transformer can also be seen as a "representation" and not as a "feature generator". Of course the boundary has always been blurry, but I was under the impression that the unspoken rule is that we call "features" only the scalar features (more precisely, we allow at most one scalar per homology dimension). In that case, it would seem that TopologicalVector
better belongs among BettiCurve
and the like. What do you think?
@gtauzin I'm thinking that this transformer can also be seen as a "representation" and not as a "feature generator". Of course the boundary has always been blurry, but I was under the impression that the unspoken rule is that we call "features" only the scalar features (more precisely, we allow at most one scalar per homology dimension). In that case, it would seem that
TopologicalVector
better belongs amongBettiCurve
and the like. What do you think?
That's an interesting point. one could argue that ComplexPolynomial
is in the same situation.
If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector
and ComplexPlolynomail
would then not qualify as representations. What do you think about this definition?
We should allow here n_distances
to be a list as for ComplexPolynomial
#479.
If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector and ComplexPlolynomail would then not qualify as representations. What do you think about this definition?
IMO, we say "representations" as a shorthand for "vector representations" which in turn is a perfect synonym for "vectorizations". So the general gist for me is that these are ways for each persistence diagram to be made into high-dimensional vectors. If the vector space structure is relevant, e.g. if one can use the Euclidean distance, cosine distance, Euclidean inner product, etc. to get meaningful quantities, then in my mind we are firmly in the realm of representations. In the end, the boundary is pretty blurred I guess. To avoid getting too philosophical, I thought that having a stronger divide (one feature per hom dim vs multi-dimensional vectors per hom dim) would make our life easier, but it's not that important to me (and the user only sees the difference in the API reference, not in import statements).
My personal preference in general would be to not tie the characterization to visualization, because then one can claim anything can be visualized and how useful that really is uncomfortably subjective for me.
We should allow here n_distances to be a list as for ComplexPolynomial #479.
This is now possible following #502. I'll make the change.
If I was to define what a representation is, I would say that it is an object whose visualization is useful to understand the information contained in a persistence diagram and on which further interesting features can be extracted. I think both TopologicalVector and ComplexPlolynomail would then not qualify as representations. What do you think about this definition?
IMO, we say "representations" as a shorthand for "vector representations" which in turn is a perfect synonym for "vectorizations". So the general gist for me is that these are ways for each persistence diagram to be made into high-dimensional vectors. If the vector space structure is relevant, e.g. if one can use the Euclidean distance, cosine distance, Euclidean inner product, etc. to get meaningful quantities, then in my mind we are firmly in the realm of representations. In the end, the boundary is pretty blurred I guess. To avoid getting too philosophical, I thought that having a stronger divide (one feature per hom dim vs multi-dimensional vectors per hom dim) would make our life easier, but it's not that important to me (and the user only sees the difference in the API reference, not in import statements).
My personal preference in general would be to not tie the characterization to visualization, because then one can claim anything can be visualized and how useful that really is uncomfortably subjective for me.
From a purely practical perspective, I feel like a transformer is a "feature generator" if what it outputs is a 2D array (n_samples, n_features) and it is a "representation" if it outputs a intermediate data structure (>3D arrays). If it is a "representation", it means that there are ways to extract interesting meaningful features from it (and we should provide them) and or it helps to visualize it to understand better the data.
Moving TopologicalVector
and the like to representations.py does not have much consequences anyways (but in that case, we should have a separate version that is in feature.py when n_distances=1
xD), so I won't fight for it. But it is weird to me to think that TopologicalVector
is a representation is strange. I would say, just make it output a 3D array with homology dimension as an axis just for consistency with the rest. But this does not make sense as it is better to be able to specify the n_distances
per homology dimension.
@gtauzin thanks for the patience and for the suggestions. I'm happy with the practical criterion you mentioned, that anything outputting 2D arrays is a feature generator. So let's keep both transformers here!
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.
:white_check_mark: ulupo
:x: Guillaume Tauzin
Reference issues/PRs
Types of changes
Description Add TopologicalVector.
Screenshots (if appropriate)
Any other comments?
Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.