Remove incorrect assumptions in Filtering

ulupo commented 4 years ago

Reference issues/PRs Fixes #91. No assumptions are made on the nature of birth/death values beyond the fact that birth >= death.

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Description Arrays are no longer sorted by lifetime using the _sort utility function before calling _filter in Filtering's transform. Since _sort was not needed anywhere else, it has been removed completely from the codebase. The fact that no sorting is made before filtering means that the outputs of Filtering are now closer to the inputs in the following sense:

if a persistence pair came after another in the input array and both are kept by the filtering, then the first pair still comes after the other in the output filtered array (and vice-versa);
the problematic line pointed out in #91 has been removed;
some issues with the documentation have been fixed (including a fix for #233 in this transformer).

Checklist

[x] I have read the guidelines for contributing.
[x] My code follows the code style of this project. I used flake8 to check my Python changes.
[x] My change requires a change to the documentation.
[x] I have updated the documentation accordingly.
[x] I have added tests to cover my changes.
[x] All new and existing tests passed. I used pytest to check this on Python tests.

gtauzin commented 4 years ago

If I am not wrong, the benefit of sorting the array was to be able to free as much memory as possible. The diagrams data structure is already extremely heavy as it is padded and includes the dimensions for each point.

You may want to try to argsort by persistence instead and only keep the points the necessary points so that you a minimally padded array.

ulupo commented 4 years ago

Thanks @gtauzin! Indeed I was hasty and did not think carefully enough about it before pushing. I have now pushed an alternative method which achieves the same reduction in size as before, but makes more minimal calculations than using argsort, resulting in better performance relative to the current implementation : for instance, generating fake input of shape (10000, 2000, 3) where, for each entry, 1000 persistence pairs are in dimension 0 and 1000 are in dimension 1, I observe a 5x speedup when averaging over various values of the cutoff.

UPDATE: The performance figures are incorrect, the savings are there but are much more modest (roughly 40% speedup in the previous example). Profiling the code shows that indeed argsort becomes more and more burdensome as the size of the data grows larger. The reason that the net gains are not more spectacular is that argsort allows to use simple array slices downstream, which I believe is faster than the indexing made necessary by the current approach.

Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.

ulupo commented 4 years ago

Thanks @gtauzin!

Concerning #91, the incriminated line is saying: "all diagrams in the output array will have as many points as the maximum number of points in any original array whose death is not equal to zero". I believe this line has a strong bias towards homology dimension 0 and interplay with VietorisRipsPersistence. That's because we pretty much know in that case that padding will be made by entries of the form (0., 0., 0.). However, it is clearly inefficient in e.g. homology dimension 1 already. There, one does not expect to have padding by (0., 0., 1) because typically no persistence pair in dimension 1 is born at 0. So the current implementation is quite inefficient in these cases. To illustrate this, take the following example:
```
import numpy as np
from gtda.diagrams import Filtering

X = np.array([[[0.5, 0.5, 1.], [0.5, 1., 1.]]])
Filtering(epsilon=0.4).fit_transform(X)
```
clearly one would like to have the output as
```
array([[[0.5, 1., 1.]]])
```
which is the case in the proposed implementation. But the current implementation returns X unchanged because the padding pair is not recognised as such by the code. You also asked about cases in which we introduce more problematic errors: take something like
```
import numpy as np
from gtda.diagrams import Filtering

X = np.array([[[-0.5, 0., 0.], [0., 1., 0.]]])
Filtering(epsilon=0.4).fit_transform(X)
```
The current code, incorrectly, returns
```
array([[[0., 1., 0.]]])
```
because the first persistence triple has a death at zero, and hence is erased. The proposed solution returns X unchanged, as it should.
I am biased of course, but I am not sure I agree that the solution is substantially more complex (although it is undoubtedly longer). The helper function _multirange has a very simple interpretation and, as I write in its docstrings, it could have been implemented by an easy-to-read one-liner. Concerning the main function (which has grown by 5 lines), let me give a go at adding inline comments to guide the reader through the logic in a transparent way.
On performance, I wrote about this in a previous comment on this thread. I observe ~40% speedup compared to the current implementation in "large" datasets. I can make more extensive tests in the small dataset regime if you like.
Let me however stress what I now think is actually my main reason for suggesting these changes (beyond the need to fix #91 and the non-earth-shattering performance improvements). I already articulated these points above but let me quote myself for convenience:

Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.

ulupo commented 4 years ago

@gtauzin I have now added extensive inline comments to guide through the new logic. Let me know what you think!

giotto-ai / giotto-tda

Remove incorrect assumptions in Filtering #436