Closed ulupo closed 3 years ago
@lewtun @wreise just asking for thoughts on this idea. The documentation is not yet updated and neither are tests, so don't expect the CI to pass yet.
this is a great idea @ulupo ! i have a few tight deadlines right now, but will be able to look at this on the weekend
It took me a while to remind myself the logic and check what you propose. I think i agree though. As you mention, tests will need to be updated.
@wreise @lewtun I have now update the docstrings and fixed the tests. @lewtun yes I have checked that outputs are the same, but independent checking should be done too.
One absolutely insane problem I met was joblib
related and caused second runs of fit_transform
in ParallelClustering
with the default backend to be much much slower than the first. It seems to be caused by some obscure reference count problem (because one needs to overwrite the labels_
attribute). This is now patched by https://github.com/ulupo/giotto-tda/blob/439cbc64132a76b5f548977ffd522371a1234576/gtda/mapper/cluster.py#L146 and was also there before, but less noticeable for some reason. It seems to point to a bug in joblib
and we should probably open an issue there...
@lewtun @wreise I'm thinking that this is ready to merge (we currently have issues with the manylinux
CI but they are unrelated to this PR and being solved in parallel). Do you have any objections? I think I addressed all comments in the reviews.
There is another breaking change that I forgot to mention, I have now added it to the PR description as "Another breaking change". Let me know if you agree/disagree.
@lewtun let me know if you still want to review or are happy with @wreise's pass.
Types of changes
Description The
ParallelClustering
transformer (technically, a metaestimator) is used in mapper pipelines as a second-to-last step, just beforeNerve
. As things are currently, the outputs ofParallelClustering.fit_transform
are in a rather exotic form, i.e. lists of lists of triples, as in e.g.These outputs make the graph construction a little simpler perhaps because they neatly present the final Mapper nodes. However, they seem to be very arcane if one is to use Mapper without the final
Nerve
step, as is made possible by passinggraph_step=False
inmake_mapper_pipeline
.In this PR I propose that the output of
ParallelClustering
should be more like the output of any clustering algorithm inscikit-learn
. In particular, it should return an array of the same length as the number of samples in the input, with each entry in the array denoting something closely corresponding to a "cluster label". Of course, for Mapper there is more than one cluster per sample in general, so I propose that the output should be a 1D object-type array where each entry is the tuple of cluster identifiers corresponding to that sample. Since this is Mapper, it makes sense to identify a single cluster via a pair(pullback_set_label, relative_cluster label)
. Hence, the final 1D object array would look likeThis is a major breaking change to
ParallelClustering
(output),Nerve
(input), andmake_mapper_pipeline
whengraph_step=False
. Another consequence is that the global node IDs in the final graphs are no longer ordered "lexicographically" with a global node ID being less than or equal to another if and only if the first pullback set label is less than or equal to the second one.Another breaking change. The fitted single clusterers are no longer stored in the
clusterers_
attribute of the fittedParallelClustering
object.clusterers_
was initially designed thinking ahead to a time when atransform
method for new data would be available. However, as we are today as close to that target as we were then, I propose we remove this until the need for it and the design for its use becomes clearer.Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.