giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
845 stars 173 forks source link

Add reduced_homology kwarg to control dropping of infinite bar in H0, fix treatment of infinity_values in CubicalPersistence, uniformize postprocessing of diagrams in gtda.homology #467

Closed ulupo closed 4 years ago

ulupo commented 4 years ago

Reference issues/PRs Documentation-wise, this is related to #92 but applies to CubicalPersistence.

Types of changes

Description At the user level, this PR achieves three things:

At the code level, the postprocessing of diagrams is unified between the simplicial and cubical transformers and between gudhi-style and ripser/flagser-style output formats. This is achieved by refactoring _postprocess_diagrams and adding a new argument "format" to it.

Checklist

gtauzin commented 4 years ago

I think there was a good reason why cubicalpersistence was not treating infinte bars the same way as simplicial persistence transformers.

I would need to think about it a bit more.

ulupo commented 4 years ago

@gtauzin thanks! Yes, I too remembered we had a conversation about this once, and forgot to ask you here!

However, when discussing with @miltminz yesterday as he was working on a use case, we saw that the following problem currently emerges: typically, there would be a final bar and it would be the largest of all, so e.g. results from various Amplitude computations will have a strong bias because of this (not very informative) feature.

ulupo commented 4 years ago

Thanks @gtauzin! The problem with reintroducing the _pad_diagram function is that such a function would look completely different to what is currently there. This is because the new logic of _postprocess_diagrams has the final 3D array initialised at the beginning (np.empty with the right shape, which is known after the maximum number of points per dim is computed), and the subdiagrams are simply inserted in-place (because, again, the start index positions for each homology dimensions are known a priori). The rest is filled with the padding triples, again in-place. The rationale for this design is that we would remove the need for creating copies (via np.pad, np.stack, etc) and simply operate in-place which is AFAIK is fast(er) and cheaper on memory.

I can forcefully introduce a _pad_diagram function that works in the current design, but it would be a bit contrived for use in _postprocess_diagrams and I'm not sure how useful you'd find it. In which circumstances do you expect to need this function, and with what specs? Of course, I'm not in principle against keeping the legacy function even if it is not used in _postprocess_diagrams but is used elsewhere.

ulupo commented 4 years ago

@gtauzin still in the saga of CubicalPersistence, I wonder why we have the current convention in place for the case infinity_values=None. Why is it not analogous to the simplicial transformers, where we would say that None means np.inf -- since for CubicalPersistence there is no analog of max_edge_length and therefore (I presume) we run the persistence algorithm all the way through pixel values?

ulupo commented 4 years ago

Thanks for the explanations @gtauzin !

But when we pass the adjacency matrix of a graph, it may not be that trivial as we remove the bar (min_persistence_value, 0). In flagser for example, you can pass the vertex weights on the diagonal, so there can be information in the birth of the final connected component.

That is a fair point, and by the way it is also valid for ripser which treats vertex weights in the same way, and hence for VietorisRipsPersistence.

As for cubical, you can't set the infinity value to be np.inf by default otherwise, you won't be able to extract any features in diagrams created with default CubicalPersistence.

I'm afraid I don't follow this point. What would go wrong if you did on a typical image? I just tested after changing the default infinity_values to np.inf and the following code:

import numpy as np
from gtda.homology import CubicalPersistence

images = np.random.random((1, 5, 5))
CubicalPersistence().fit_transform(images)

gives me identical results to using infinity_values equal to np.max(X) as default (assuming the infinite bar is removed, otherwise of course the results differ by the presence of that extra bar).

I'm starting to think that the best thing we can do is give the user the option to avoid killing the infinite bar via a parameter in the transformer constructor. By default, bars should probably still be removed in the simplicial transformers (and I think also in cubical for the reasons mentioned above), but the user should be able to get all the information they need if they want. However, they then proceed at their own risk for e.g. the transformers in gtda.diagrams which have really not been battle-tested against the presence of infinities.

ulupo commented 4 years ago

@gtauzin I have now implemented a proposal for allowing the user to switch off the removal of the H0 infinite bar if they so wish. There is now a reduced_homology kwarg in all simplicial and cubical transformers, which defaults to True (even for cubical, though I realise we are still debating this). The current kwarg name is just because this is what I think it's doing mathematically.

I still think it's weird that we use np.max(X) if infinity_values=None in CubicalTransformer (see above), but I haven't changed that.

Note to self. The case reduced_homology=False needs to now be covered in unit tests.

gtauzin commented 4 years ago

@ulupo I am leaning towards the other solution actually: 1 - Calculate the diameter of the point cloud / i.e the biggest distance between two points for simplicial or the highest pixel values for cubical. Use twice that value as default infinity_value no matter what max_edge/weight_length is. 2 - Do not remove any points

This has the advantage to: 1 - Be correct: you always have all the points and even someone who don't read the documentation will understand what is the output. 2 - Be meaningful: by always having infinite bars be the most persistent ones (because they are at most 2 diameters), right now, if an infinite bar is born at max_edge_length, it ends up having 0 persistence! This is clearly broken. With my suggestions, it would have a persistence value corresponding to the diameters of the point clouds, i.e, the biggest one in the diagrams. 3 - Be modular: diagrams almost always have to be post processed to be used ML tasks, and this should be done in a modular way in one of the postprocessing transformers, not here (). Even outside of this infinity_values discussion, users should understand they have to look at diagrams and postprocess them to create TDA pipelines.

(*) I have been meaning to open a PR with an above option to the Filtering transformer, because in practice you almost always have to filter the most persistent points to be able to extract features from diagrams.

Also, reduced homology is a thing, so it could be confusing for the user to use a term that does not have the right mathematical meaning. I am not against the option. I just believe the default infinity_values behavior should be improved.

ulupo commented 4 years ago

Oh dear, we have a disagreement here :) Maybe we could involve other collaborators! @ammedmar, @lewtun what do you make of all of this? We can take it offline in a dev discussion if you like, I'm flexible.

gtauzin commented 4 years ago

@ulupo, I suggest we handle that the mature way!

Arm wrestling xD

ulupo commented 4 years ago

Haha, emotions running high!

gtauzin commented 4 years ago

Can you maybe explain why you disagree? I could be convinced, who knows?! :)

And maybe address the problematic point of the default values of infintiy_values being max_edge_length?

ulupo commented 4 years ago

Sure, and it would help others better understand. There are two connected but logically distinct issues here. I'll treat them separately to begin with.

On removing one infinite bar in degree 0 by default

Here my proposal is that, mathematically, we should always return reduced persistent homology to the user by default, and "ordinary" homology only if they want it. Note that this does not mean returning incorrect barcodes, just barcodes for reduced PH! My understanding is that reduced PH corresponds to removing the infinite bar associated (because of the elder rule) to the first 0-simplex which enters the filtration. And again my understanding is that ripser returns this always in the last position, while GUDHI always returns it first among the bars in dimension 0 (this to explain how we remove it, implementationally speaking). I would love for others to double-check this claim, of course.

Why this? Here are my reasons in ascending order of strength according to me:

In my proposal, power users will still be able to compute barcode for non-reduced PH if they so wish (an enhancement over the previous versions, without a breaking change). They just need to pass reduced_homology=False -- suggestions for better names welcome! If they want that minimum pixel value in the case of CubicalPersistence, or the minimum vertex weight in VietorisRipsPersistence or FlagserPersistence, they can still get it but the trade-off is getting an infinity unless they pass something for infinity_values (sub-proposal: we could allow them to pass the string 'max' meaning the maximum edge value seen, or maximum pixel value seen...).

This leads me to the second issue...

On making the behaviour of CubicalPersistence consistent with the simplicial transformers

At the moment, we have the following logic in all simplicial transformers:

Hence, the default behaviour is to go all the way to infinity, and to leave any infinite death times -- whether they are in H_0 or not -- at infinity.

My proposal here is to make CubicalPersistence follow this logic as closely as possible. Now:

On the agenda

My understanding in the different philosophies is: I would prefer to have exact reduced homology computed by default, unless explicitly stated otherwise, in all cases (simplicial or not). I would prefer not to replace death values by finite values (even if they are dataset-dependent), so we can simply say: the default behaviour is computing reduced homology, but otherwise in the same sense as GUDHI and ripser. On the other hand, @gtauzin is arguing in favour of such replacements, by default.

One practical argument I brought forward against keeping the H0 feature but with infinity replaced with the max pixel value was the observation made by @miltminz, when using CubicalPersistence in his image data, that this H0 feature tends to skew all scalar features (entropy, amplitudes, etc) strongly and makes it more difficult to discriminate effectively between different training data points.

Thanks for watching

ammedmar commented 4 years ago

Hi! The name reduced homology is the right term for what the proposal is doing. A well known concept that has some algebraic advantages. I see why in the cubical case one losses the smallest non-zero pixel value, but that seems to be information that can be easily retrieved from a non-topological analysis of the image-set. I see the use of reduced homology as not conflicting with Guillaume's "renormalization" procedure, transforming infinite bars to finite ones in a meaningful way. Although, renormalizing seems to be at odds with the principle of "least post-processing" as stated in Guillaume's point 3 -.

In summary, from my point of view: 1) I prefer reduced homology and 2) I am impartial to renormalization.

gtauzin commented 4 years ago

Let me write a reply with big titles too, but instead of suggesting anything I will only raise issues about behavioral problems, that I have raised in my previous comments, but probably not in the right way seeing your reply. Sorry about that.

On removing one infinite bar for H0

I do agree that they are almost always a nuisance. My first comment was about whether there was any information contains exclusively in the birth of that infinite bar that could be harmful if we remove it. However, you convinced me and I love the reduced option and I am also for a default behavior in which reduced homology is computed. Especially as it has a mathematical meaning!

On filtering points with large persistence but present in all diagrams

Those points can have an overwhelming impact on diagrams features and there should be a way to filter them. I suggest to add an above option to the diagram Filtering transformer

On making CubicalPersistence consistent with the rest

First, I completely agree it should be consistent! I actually never disagreed, I just question what this common ground should be. Indeed, I believe that:

On setting max_edge_length and infinite_values parameters

I believe there is currently the issue that we really don't know how to choose those parameters as they strongly depend on the type of data you are dealing with. This very similar to the problem we had when we wanted to ask the user to give a threshold for filtering diagrams. I would like to suggest the same solution, which is to have a scaler for point clouds / graphs that ensures that the "diameter" of the point cloud is rescaled to 1.

ammedmar commented 4 years ago

I wrote a comment yesterday after Guillaume's but I do not see it today. So here is again: I also agree that "A TDA pipeline initialized with default values should always run". This is probably clear to everyone else, but, what breaks if infinities are left in? Anyway, if something unavoidably breaks, I like the idea of using a principled renormalization, and twice the largest possible length as right endpoint seems pretty good. About rescaling the data itself, I am less sure about.

miltminz commented 4 years ago

Hello everybody, since I'm often in contact with @ulupo for testing and suggesting enhancements, he asked me to leave some comments as well. So here is my opinion, as far as I could understand:

Other remarks that may or may not be connected with those above:

I hope it helps!

ulupo commented 4 years ago

@gtauzin the new check_collection introuduced in #460 is now employed in CubicalPersistence, thus allowing list input. If I understood how we intend to move forward, this PR is ready for a final review and merge.

gtauzin commented 4 years ago

@ulupo I will do that now!

@miltminz Thank you very much for your feedback :)

ulupo commented 4 years ago

Something went wrong at GitHub's end and for some reason this was not squashed when merged (despite me trying to do that).

ulupo commented 4 years ago

Actually, worse still, it seems to have been both merged and squashed-and-merged. Pfff what a nightmare. Hopefully the end result is what we want anyway.