Closed ulupo closed 4 years ago
I think there was a good reason why cubicalpersistence was not treating infinte bars the same way as simplicial persistence transformers.
I would need to think about it a bit more.
@gtauzin thanks! Yes, I too remembered we had a conversation about this once, and forgot to ask you here!
However, when discussing with @miltminz yesterday as he was working on a use case, we saw that the following problem currently emerges: typically, there would be a final bar and it would be the largest of all, so e.g. results from various Amplitude
computations will have a strong bias because of this (not very informative) feature.
Thanks @gtauzin! The problem with reintroducing the _pad_diagram
function is that such a function would look completely different to what is currently there. This is because the new logic of _postprocess_diagrams
has the final 3D array initialised at the beginning (np.empty
with the right shape, which is known after the maximum number of points per dim is computed), and the subdiagrams are simply inserted in-place (because, again, the start index positions for each homology dimensions are known a priori). The rest is filled with the padding triples, again in-place. The rationale for this design is that we would remove the need for creating copies (via np.pad
, np.stack
, etc) and simply operate in-place which is AFAIK is fast(er) and cheaper on memory.
I can forcefully introduce a _pad_diagram
function that works in the current design, but it would be a bit contrived for use in _postprocess_diagrams
and I'm not sure how useful you'd find it. In which circumstances do you expect to need this function, and with what specs? Of course, I'm not in principle against keeping the legacy function even if it is not used in _postprocess_diagrams
but is used elsewhere.
@gtauzin still in the saga of CubicalPersistence
, I wonder why we have the current convention in place for the case infinity_values=None
. Why is it not analogous to the simplicial transformers, where we would say that None
means np.inf
-- since for CubicalPersistence
there is no analog of max_edge_length
and therefore (I presume) we run the persistence algorithm all the way through pixel values?
Thanks for the explanations @gtauzin !
But when we pass the adjacency matrix of a graph, it may not be that trivial as we remove the bar (min_persistence_value, 0). In flagser for example, you can pass the vertex weights on the diagonal, so there can be information in the birth of the final connected component.
That is a fair point, and by the way it is also valid for ripser
which treats vertex weights in the same way, and hence for VietorisRipsPersistence
.
As for cubical, you can't set the infinity value to be np.inf by default otherwise, you won't be able to extract any features in diagrams created with default CubicalPersistence.
I'm afraid I don't follow this point. What would go wrong if you did on a typical image? I just tested after changing the default infinity_values
to np.inf
and the following code:
import numpy as np
from gtda.homology import CubicalPersistence
images = np.random.random((1, 5, 5))
CubicalPersistence().fit_transform(images)
gives me identical results to using infinity_values
equal to np.max(X)
as default (assuming the infinite bar is removed, otherwise of course the results differ by the presence of that extra bar).
I'm starting to think that the best thing we can do is give the user the option to avoid killing the infinite bar via a parameter in the transformer constructor. By default, bars should probably still be removed in the simplicial transformers (and I think also in cubical for the reasons mentioned above), but the user should be able to get all the information they need if they want. However, they then proceed at their own risk for e.g. the transformers in gtda.diagrams
which have really not been battle-tested against the presence of infinities.
@gtauzin I have now implemented a proposal for allowing the user to switch off the removal of the H0 infinite bar if they so wish. There is now a reduced_homology
kwarg in all simplicial and cubical transformers, which defaults to True
(even for cubical, though I realise we are still debating this). The current kwarg name is just because this is what I think it's doing mathematically.
I still think it's weird that we use np.max(X)
if infinity_values=None
in CubicalTransformer
(see above), but I haven't changed that.
Note to self. The case reduced_homology=False
needs to now be covered in unit tests.
@ulupo I am leaning towards the other solution actually:
1 - Calculate the diameter of the point cloud / i.e the biggest distance between two points for simplicial or the highest pixel values for cubical. Use twice that value as default infinity_value
no matter what max_edge/weight_length
is.
2 - Do not remove any points
This has the advantage to:
1 - Be correct: you always have all the points and even someone who don't read the documentation will understand what is the output.
2 - Be meaningful: by always having infinite bars be the most persistent ones (because they are at most 2 diameters), right now, if an infinite bar is born at max_edge_length
, it ends up having 0 persistence! This is clearly broken. With my suggestions, it would have a persistence value corresponding to the diameters of the point clouds, i.e, the biggest one in the diagrams.
3 - Be modular: diagrams almost always have to be post processed to be used ML tasks, and this should be done in a modular way in one of the postprocessing transformers, not here (). Even outside of this infinity_values
discussion, users should understand they have to look at diagrams and postprocess them to create TDA pipelines.
(*) I have been meaning to open a PR with an above
option to the Filtering
transformer, because in practice you almost always have to filter the most persistent points to be able to extract features from diagrams.
Also, reduced homology
is a thing, so it could be confusing for the user to use a term that does not have the right mathematical meaning. I am not against the option. I just believe the default infinity_values
behavior should be improved.
Oh dear, we have a disagreement here :) Maybe we could involve other collaborators! @ammedmar, @lewtun what do you make of all of this? We can take it offline in a dev discussion if you like, I'm flexible.
@ulupo, I suggest we handle that the mature way!
Arm wrestling xD
Haha, emotions running high!
Can you maybe explain why you disagree? I could be convinced, who knows?! :)
And maybe address the problematic point of the default values of infintiy_values
being max_edge_length
?
Sure, and it would help others better understand. There are two connected but logically distinct issues here. I'll treat them separately to begin with.
Here my proposal is that, mathematically, we should always return reduced persistent homology to the user by default, and "ordinary" homology only if they want it. Note that this does not mean returning incorrect barcodes, just barcodes for reduced PH! My understanding is that reduced PH corresponds to removing the infinite bar associated (because of the elder rule) to the first 0-simplex which enters the filtration. And again my understanding is that ripser
returns this always in the last position, while GUDHI
always returns it first among the bars in dimension 0 (this to explain how we remove it, implementationally speaking). I would love for others to double-check this claim, of course.
Why this? Here are my reasons in ascending order of strength according to me:
CubicalPersistence
does something which is neither removing the infinite bar, nor keeping it. Instead it changes the death time for that bar to the highest filtration value seen in the training set. This is the "logically distinct issue" below so I won't elaborate on it here. Notice that none of our feature- or representation-generation transformers have been unfortunately extensively tested for the presence of infinite values in the diagrams, so at the very least there would be a substantial development overhead in making this breaking change for the next release.giotto-tda
because they want to make a custom transformer but every time they have to deal with the presence of infinities? My intuition is that the balance is in favour of those who won't need non-zero vertex weights when using VietorisRipsPersistence
, SparseRipsPersistence
, or FlagserPersistence
. In the case of CubicalPersistence
it is true that reduced homology will always throw away a potentially non-trivial piece of information in the lowest pixel value, and this is true even for basic users.In my proposal, power users will still be able to compute barcode for non-reduced PH if they so wish (an enhancement over the previous versions, without a breaking change). They just need to pass reduced_homology=False
-- suggestions for better names welcome! If they want that minimum pixel value in the case of CubicalPersistence
, or the minimum vertex weight in VietorisRipsPersistence
or FlagserPersistence
, they can still get it but the trade-off is getting an infinity unless they pass something for infinity_values
(sub-proposal: we could allow them to pass the string 'max'
meaning the maximum edge value seen, or maximum pixel value seen...).
This leads me to the second issue...
CubicalPersistence
consistent with the simplicial transformersAt the moment, we have the following logic in all simplicial transformers:
max_edge_length
(or max_edge_weight
in the flagser case), the filtration stops there, otherwise the default value is numpy.inf
.infinity_values
, any death which is reported as being at infinity by the C++ backends -- either because they went all the way to infinity and the features did not die, or because they stopped earlier and the death time cannot be known -- is brought to infinity_values
. If this causes some bars to have birth < death, these bars are simply thrown away. If you pass None
, the death is brought to infinity_values
, whatever that was.Hence, the default behaviour is to go all the way to infinity, and to leave any infinite death times -- whether they are in H_0 or not -- at infinity.
My proposal here is to make CubicalPersistence
follow this logic as closely as possible. Now:
CubicalPersistence
does not offer a max_pixel_value
kwarg analogous to max_edge_length
. The filtration just goes all the way. This corresponds to max_edge_length=np.inf
in the simplicial case.infinity_values=None
mean? I argue that, by strict analogy with the simplicial case, it should mean "leave any infinities as they are".My understanding in the different philosophies is: I would prefer to have exact reduced homology computed by default, unless explicitly stated otherwise, in all cases (simplicial or not). I would prefer not to replace death values by finite values (even if they are dataset-dependent), so we can simply say: the default behaviour is computing reduced homology, but otherwise in the same sense as GUDHI and ripser. On the other hand, @gtauzin is arguing in favour of such replacements, by default.
One practical argument I brought forward against keeping the H0 feature but with infinity replaced with the max pixel value was the observation made by @miltminz, when using CubicalPersistence
in his image data, that this H0 feature tends to skew all scalar features (entropy, amplitudes, etc) strongly and makes it more difficult to discriminate effectively between different training data points.
Thanks for watching
Hi! The name reduced homology is the right term for what the proposal is doing. A well known concept that has some algebraic advantages. I see why in the cubical case one losses the smallest non-zero pixel value, but that seems to be information that can be easily retrieved from a non-topological analysis of the image-set. I see the use of reduced homology as not conflicting with Guillaume's "renormalization" procedure, transforming infinite bars to finite ones in a meaningful way. Although, renormalizing seems to be at odds with the principle of "least post-processing" as stated in Guillaume's point 3 -.
In summary, from my point of view: 1) I prefer reduced homology and 2) I am impartial to renormalization.
Let me write a reply with big titles too, but instead of suggesting anything I will only raise issues about behavioral problems, that I have raised in my previous comments, but probably not in the right way seeing your reply. Sorry about that.
I do agree that they are almost always a nuisance.
My first comment was about whether there was any information contains exclusively in the birth of that infinite bar that could be harmful if we remove it. However, you convinced me and I love the reduced
option and I am also for a default behavior in which reduced homology is computed. Especially as it has a mathematical meaning!
Those points can have an overwhelming impact on diagrams features and there should be a way to filter them.
I suggest to add an above
option to the diagram Filtering
transformer
CubicalPersistence
consistent with the restFirst, I completely agree it should be consistent! I actually never disagreed, I just question what this common ground should be. Indeed, I believe that:
infinte_values = None
means leaving all infinite bars, it is no problem, it should not be the default behavior.max_edge/weight_length = max
is fixed, an infinite bar whose birth happens right before max
should have a death value that really characterizes its large persistence relatively to other points. (hence my proposal of having the default of "twice the diameters/ max pixel value of all point clouds / images in the collection", that would ensure that)max_edge_length
and infinite_values
parametersI believe there is currently the issue that we really don't know how to choose those parameters as they strongly depend on the type of data you are dealing with. This very similar to the problem we had when we wanted to ask the user to give a threshold for filtering diagrams. I would like to suggest the same solution, which is to have a scaler for point clouds / graphs that ensures that the "diameter" of the point cloud is rescaled to 1.
I wrote a comment yesterday after Guillaume's but I do not see it today. So here is again: I also agree that "A TDA pipeline initialized with default values should always run". This is probably clear to everyone else, but, what breaks if infinities are left in? Anyway, if something unavoidably breaks, I like the idea of using a principled renormalization, and twice the largest possible length as right endpoint seems pretty good. About rescaling the data itself, I am less sure about.
Hello everybody, since I'm often in contact with @ulupo for testing and suggesting enhancements, he asked me to leave some comments as well. So here is my opinion, as far as I could understand:
reduced
option as default for computing diagrams. And I like the idea that we can change this if we want. However, I wonder what happens right after if one wants to compute, for example, persistence entropy. Suppose you have a diagram with only two h0 points, one with infinite death and another one far from the diagonal. If the new default behaviour kills the infinity point, the other one is left in the diagram and we will have 0 entropy. Is this consistent with the diagram? I would expect that the surviving point counts in the various metrics that can be computed from the diagrams.infinity_values
, I would prefer as default behaviour to have the numpy.inf
associated to infinite bars, which is conceptually right. Now the question is what to do with these infinite values if one wants to compute metrics for the diagrams. How are these infinity values treated in the mathematical computation of entropy, amplitude, persistence images, etc? One remark about images: it will definitely be nice to have a max_pixel_value
instead of infinity in the case the images have a finite pixel values range, say [0, 1]. But again I raise the question how this will influence the various metrics if it is not treated as infinite and not dropped from the diagram if not in reduced homology situation.Other remarks that may or may not be connected with those above:
CubicalPersistence
& Various different metrics (PersistenceEntropy
, Amplitude
) on images with values scaled in [0, 1] and not. The correlations of the topological features obtained in the output are different in the two cases, which is confusing. I wonder whether, for the sake of principle of scale-invariance of diagrams, we should get similar values, meaning finally to scale amplitudes in order to get the same metrics from essentially two images which are the same up to scaling.Scaler()
and Filtering(0.01)
in the pipeline. Then again, the same pipeline applied to scaled images produces a completely different result than the one applied to non-scaled images. In this case the epsilon value in the filtering should be somehow adapted in case the image is scaled or not to get the same output. So finally my question is, should you bother about unifying diagrams and metrics up to scale-invariance, or should that be treated separately by the user (i.e. scaling images or not and deal with different outputs)?I hope it helps!
@gtauzin the new check_collection
introuduced in #460 is now employed in CubicalPersistence
, thus allowing list input. If I understood how we intend to move forward, this PR is ready for a final review and merge.
@ulupo I will do that now!
@miltminz Thank you very much for your feedback :)
Something went wrong at GitHub's end and for some reason this was not squashed when merged (despite me trying to do that).
Actually, worse still, it seems to have been both merged and squashed-and-merged. Pfff what a nightmare. Hopefully the end result is what we want anyway.
Reference issues/PRs Documentation-wise, this is related to #92 but applies to
CubicalPersistence
.Types of changes
Description At the user level, this PR achieves three things:
reduced_homology
kwarg to all simplicial and cubical transformers. WhenTrue
, we remove one infinite bar in H0 for the user automatically. Default is alwaysTrue
and corresponds to what we've been doing so far in all simplicial transformers. However, in the cubical case this causes a breaking change (see below).CubicalPersistence
the removal of potential negative-lifetime bars which can happen ifinfinity_values
is set too low;CubicalPersistence
the removal of one infinite bar in dimension 0 by default.At the code level, the postprocessing of diagrams is unified between the simplicial and cubical transformers and between gudhi-style and ripser/flagser-style output formats. This is achieved by refactoring
_postprocess_diagrams
and adding a new argument "format" to it.Checklist
flake8
to check my Python changes.addedmodified tests to cover my changes.pytest
to check this on Python tests.