giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
845 stars 173 forks source link

Add bindings for collapser #469

Closed reds-heig closed 4 years ago

reds-heig commented 4 years ago

Reference issues/PRs

Types of changes

Description

Add bindings for collapser from Gudhi new feature.

Screenshots (if appropriate)

Any other comments?

Currently work in progress, there's a bug on the size of the matrix returns by the C++ in python.

Checklist

CLAassistant commented 4 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: ulupo
:x: MonkeyBreaker
You have signed the CLA already but the status is still pending? Let us recheck it.

ulupo commented 4 years ago

@MonkeyBreaker I have hacked your code to enable edge collapsing with the ripser python interface. I'm sure the C++ is totally stupid, I just hacked a solution which compiled and seems to give the right results.

You can try this in python:

import numpy as np
from gtda.externals import ripser

X = np.random.random((100, 100))
np.fill_diagonal(X, 0)  # Without this line, there might be different results because GUDHI assumes zeros in the diagonal

maxdim = 2

diags_collapsed = ripser(X, metric='precomputed', maxdim=maxdim, collapse_edges=True)['dgms']
diags_not_collapsed = ripser(X, metric='precomputed', maxdim=maxdim, collapse_edges=False)['dgms']

thresh = 0.1
diags_collapsed_thresh = ripser(X, metric='precomputed', maxdim=maxdim, thresh=thresh, collapse_edges=True)['dgms']
diags_not_collapsed_thresh = ripser(X, metric='precomputed', maxdim=maxdim, thresh=thresh, collapse_edges=False)['dgms']

for i in range(2):
    assert np.array_equal(diags_collapsed[i], diags_not_collapsed[i])
    assert np.array_equal(diags_collapsed_thresh[i], diags_not_collapsed_thresh[i])

For now, I do not see substantial speedups and in fact I only see slow-downs. It would be good to see if you can reproduce the performance gains reported at the end of the paper https://hal.inria.fr/hal-02395227/document.

Another point: The thresholding is done in python at the moment, but I guess ideally it would be done in C++. I think this means that we would have to call the edge collapse C++ function inside ripser.cpp, but I'm not sure. Notice that to reproduce the results from the paper you will need to use the thresh parameter.

ulupo commented 4 years ago

@MonkeyBreaker to test the bindings you had originally written (and which I "fixed") on the simple GUDHI example, you can do:

from gtda.externals.modules.gtda_collapser import flag_complex_collapse_edges_sparse, flag_complex_collapse_edges_dense
from scipy.sparse import coo_matrix

# Sparse case
X = coo_matrix(([1., 1., 1., 1., 2., 2.], ([0, 1, 2, 3, 0, 1], [1, 2, 3, 0, 2, 3])))
collapsed_edges = flag_complex_collapse_edges_sparse(X)
assert collapsed_edges[1, 3] == 0.

# Dense case
X = X.toarray()
collapsed_edges = flag_complex_collapse_edges_dense(X)
assert collapsed_edges[1, 3] == 0.
MonkeyBreaker commented 4 years ago

@ulupo bindings updated and test added

MonkeyBreaker commented 4 years ago

@ulupo seems good to me, on my machine everything runs smoothly :)

I don't know if you're planning on adding more on this PR. But I was thinking on maybe disabling the computation of cocycles is we don't return them at the moment, it will make computation faster. Or maybe we plan to use them in the future ?

ulupo commented 4 years ago

@MonkeyBreaker I just added a regression test for #465. For me this would be enough to close the PR, but since you raise a good point...

But I was thinking on maybe disabling the computation of cocycles is we don't return them at the moment, it will make computation faster.

We don't really have a good plan for cocycles yet but I'd say this is more because of lack of dedicated research time than because we thought about it and decided they're not useful. Would there be a middle way? Such as producing two separate bindings, and deciding which one to call at runtime?

In any case, do you think it would be OK to do this in another PR to get less confused?

Notice that I have added checks and warnings for when any diagonal entry is non-zero (in either sparse or dense input). I ran some tests and I was getting inconsistent results when collapsing edges, so I think that the edge collapser can give incorrect results if run on input with non-zero diagonal entries. This is why I had to introduce extra checks and changed a little the way in which the lexsort flags are used. Welcome to double check!

MonkeyBreaker commented 4 years ago

Sure, for me if all test pass, we can let the PR as if and merge it :)

About the cocycles, the "issue" is not related to the compilation it's an input runtime parameter do_cocycles. by default it's enable => meaning that cocycles are computed, but in DRFDM and DRFDMSparse they are not even retrieved after calling ripser.

Which currently means that because by default the we enable the computation of cocycles we waste a bit of runtime on it. The change I was thinking about is until we decide what to do with the cocycles, update as follow:

res = DRFDM(DParam, maxdim, thresh, coeff, do_cocycles=0)
// and
res = DRFDMSparse(
             row.astype(dtype=np.int32, order="C"),
             ...
             coeff,
             do_cocycles=0
             )

When all test passes on the pipeline I'll check locally if it's good for you ?

ulupo commented 4 years ago

@MonkeyBreaker oh ok, it's really simple then, just put a zero when the functions are called inside ripser. We can slot this into the current PR then I think!