Refactor GraphGeodesicDistance

ulupo commented 4 years ago

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)

Description There are a couple of major problems with the current implementation of GraphGeodesicDistance:

The return type of transform is always ndarray, even when different distance matrix shapes imply that the ndarray is 1D. But in this case, output fed to the homology transformers will fail due to the current implementation of check_point_cloud, see https://github.com/giotto-ai/giotto-tda/blob/188b6755b7f567a49d0a15cb63492de29a81ab45/gtda/utils/validation.py#L260
The handling of zero entries, infinity entries and non-stored entries in the sparse case was quite opaque and some non-generic shortcuts were taken in the code to address most, but not all, cases.

This PR fixes both problems and introduces additional changes. In particular:

The return type of transform is only ndarray if it can be turned into a 3D ndarray, else it is list.
Scipy's shortest_path replaces scikit-learn's graph_shortest_path. This is because it supports a wider range of algorithms, it supports masked arrays, it is better maintained, and it has "better" behaviour in the sparse case (see below). I have created a gist to exhibit the difference in behaviour and the new behaviour of GraphGeodesicDistance: https://gist.github.com/ulupo/83cc82ce83379ebda8fdfe846d0c06a5.
A particular convention is clearly established:
- if the input arrays are dense, then absent edges must be indicated by numpy.inf;
- zero edges in int or float arrays do not denote absent edges, but edges of length 0;
- on the other hand, False edges in Boolean arrays do denote absent edges;
- in the sparse case, non-stored values are interpreted as absent edges.
The additional parameters directed, unweighted and method are made available, with the obvious meanings.

Checklist

[x] I have read the guidelines for contributing.
[x] My code follows the code style of this project. I used flake8 to check my Python changes.
[x] My change requires a change to the documentation.
[x] I have updated the documentation accordingly.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed. I used pytest to check this on Python tests.

CLAassistant commented 4 years ago

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

:white_check_mark: ulupo
:x: Umberto

Umberto seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ulupo commented 4 years ago

@wreise as per our discussion, there is a SciPy inconsistency which I have now signalled in https://github.com/scipy/scipy/issues/12424. It is basically impossible to have the wanted results using the Floyd-Warshall algorithm (option 'FW', which could also be selected when method='auto') when some edges have zero weight. In 50b6c1b, I introduced a check for this which overrides the user selection if necessary and warns the user of the situation.

Notice that the test ground truths were incorrect (!): if one node has zero distance from every other node, then all nodes have zero distance from all other nodes. I have fixed this.

ulupo commented 4 years ago

@wreise I've implemented essentially all the tests in the above gist as unit tests. All algorithms are also tested to give the same results.

giotto-ai / giotto-tda

Refactor GraphGeodesicDistance #422