Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention

ulupo commented 4 years ago

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)

Description The ParallelClustering transformer (technically, a metaestimator) is used in mapper pipelines as a second-to-last step, just before Nerve. As things are currently, the outputs of ParallelClustering.fit_transform are in a rather exotic form, i.e. lists of lists of triples, as in e.g.

[
    [(pullback_set_label_0, relative_cluster_label_00, node_elements_00),
     (pullback_set_label_0, relative_cluster_label_01, node_elements_01), ...],
    [(pullback_set_label_1, relative_cluster_label_10, node_elements_10),
     (pullback_set_label_1, relative_cluster_label_11, node_elements_11), ...],
    ...
]

These outputs make the graph construction a little simpler perhaps because they neatly present the final Mapper nodes. However, they seem to be very arcane if one is to use Mapper without the final Nerve step, as is made possible by passing graph_step=False in make_mapper_pipeline.

In this PR I propose that the output of ParallelClustering should be more like the output of any clustering algorithm in scikit-learn. In particular, it should return an array of the same length as the number of samples in the input, with each entry in the array denoting something closely corresponding to a "cluster label". Of course, for Mapper there is more than one cluster per sample in general, so I propose that the output should be a 1D object-type array where each entry is the tuple of cluster identifiers corresponding to that sample. Since this is Mapper, it makes sense to identify a single cluster via a pair (pullback_set_label, relative_cluster label). Hence, the final 1D object array would look like

array([
    ((pullback_set_label_i, relative_cluster_label_ia), (pullback_set_label_j, relative_cluster_label_jb)),
    ((pullback_set_label_k, relative_cluster_label_kc), (pullback_set_label_l, relative_cluster_label_ld)),
    ...
])

This is a major breaking change to ParallelClustering (output), Nerve (input), and make_mapper_pipeline when graph_step=False. Another consequence is that the global node IDs in the final graphs are no longer ordered "lexicographically" with a global node ID being less than or equal to another if and only if the first pullback set label is less than or equal to the second one.

Another breaking change. The fitted single clusterers are no longer stored in the clusterers_ attribute of the fitted ParallelClustering object. clusterers_ was initially designed thinking ahead to a time when a transform method for new data would be available. However, as we are today as close to that target as we were then, I propose we remove this until the need for it and the design for its use becomes clearer.

Checklist

[x] I have read the guidelines for contributing.
[x] My code follows the code style of this project. I used flake8 to check my Python changes.
[x] My change requires a change to the documentation.
[x] I have updated the documentation accordingly.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed. I used pytest to check this on Python tests.

ulupo commented 4 years ago

@lewtun @wreise just asking for thoughts on this idea. The documentation is not yet updated and neither are tests, so don't expect the CI to pass yet.

lewtun commented 3 years ago

this is a great idea @ulupo ! i have a few tight deadlines right now, but will be able to look at this on the weekend

wreise commented 3 years ago

It took me a while to remind myself the logic and check what you propose. I think i agree though. As you mention, tests will need to be updated.

ulupo commented 3 years ago

@wreise @lewtun I have now update the docstrings and fixed the tests. @lewtun yes I have checked that outputs are the same, but independent checking should be done too.

One absolutely insane problem I met was joblib related and caused second runs of fit_transform in ParallelClustering with the default backend to be much much slower than the first. It seems to be caused by some obscure reference count problem (because one needs to overwrite the labels_ attribute). This is now patched by https://github.com/ulupo/giotto-tda/blob/439cbc64132a76b5f548977ffd522371a1234576/gtda/mapper/cluster.py#L146 and was also there before, but less noticeable for some reason. It seems to point to a bug in joblib and we should probably open an issue there...

ulupo commented 3 years ago

@lewtun @wreise I'm thinking that this is ready to merge (we currently have issues with the manylinux CI but they are unrelated to this PR and being solved in parallel). Do you have any objections? I think I addressed all comments in the reviews.

There is another breaking change that I forgot to mention, I have now added it to the PR description as "Another breaking change". Let me know if you agree/disagree.

ulupo commented 3 years ago

@lewtun let me know if you still want to review or are happy with @wreise's pass.

giotto-ai / giotto-tda

Change ParallelClustering to output an array of labels per sample, closer to scikit-learn convention #535