Closed ulupo closed 4 years ago
If I am not wrong, the benefit of sorting the array was to be able to free as much memory as possible. The diagrams data structure is already extremely heavy as it is padded and includes the dimensions for each point.
You may want to try to argsort by persistence instead and only keep the points the necessary points so that you a minimally padded array.
Thanks @gtauzin! Indeed I was hasty and did not think carefully enough about it before pushing. I have now pushed an alternative method which achieves the same reduction in size as before, but makes more minimal calculations than using argsort
, resulting in better performance relative to the current implementation : for instance, generating fake input of shape (10000, 2000, 3)
where, for each entry, 1000
persistence pairs are in dimension 0 and 1000
are in dimension 1, I observe a 5x speedup when averaging over various values of the cutoff.
UPDATE: The performance figures are incorrect, the savings are there but are much more modest (roughly 40% speedup in the previous example). Profiling the code shows that indeed argsort
becomes more and more burdensome as the size of the data grows larger. The reason that the net gains are not more spectacular is that argsort
allows to use simple array slices downstream, which I believe is faster than the indexing made necessary by the current approach.
Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.
Thanks @gtauzin!
Concerning #91, the incriminated line is saying: "all diagrams in the output array will have as many points as the maximum number of points in any original array whose death is not equal to zero". I believe this line has a strong bias towards homology dimension 0 and interplay with VietorisRipsPersistence
. That's because we pretty much know in that case that padding will be made by entries of the form (0., 0., 0.)
. However, it is clearly inefficient in e.g. homology dimension 1 already. There, one does not expect to have padding by (0., 0., 1)
because typically no persistence pair in dimension 1 is born at 0. So the current implementation is quite inefficient in these cases. To illustrate this, take the following example:
import numpy as np
from gtda.diagrams import Filtering
X = np.array([[[0.5, 0.5, 1.], [0.5, 1., 1.]]])
Filtering(epsilon=0.4).fit_transform(X)
clearly one would like to have the output as
array([[[0.5, 1., 1.]]])
which is the case in the proposed implementation. But the current implementation returns X
unchanged because the padding pair is not recognised as such by the code. You also asked about cases in which we introduce more problematic errors: take something like
import numpy as np
from gtda.diagrams import Filtering
X = np.array([[[-0.5, 0., 0.], [0., 1., 0.]]])
Filtering(epsilon=0.4).fit_transform(X)
The current code, incorrectly, returns
array([[[0., 1., 0.]]])
because the first persistence triple has a death at zero, and hence is erased. The proposed solution returns X
unchanged, as it should.
_multirange
has a very simple interpretation and, as I write in its docstrings, it could have been implemented by an easy-to-read one-liner. Concerning the main function (which has grown by 5 lines), let me give a go at adding inline comments to guide the reader through the logic in a transparent way.Let me however stress what I now think is actually my main reason for suggesting these changes (beyond the need to fix #91 and the non-earth-shattering performance improvements). I already articulated these points above but let me quote myself for convenience:
Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.
@gtauzin I have now added extensive inline comments to guide through the new logic. Let me know what you think!
Reference issues/PRs Fixes #91. No assumptions are made on the nature of birth/death values beyond the fact that birth >= death.
Types of changes
Description Arrays are no longer sorted by lifetime using the
_sort
utility function before calling_filter
inFiltering
'stransform
. Since_sort
was not needed anywhere else, it has been removed completely from the codebase. The fact that no sorting is made before filtering means that the outputs ofFiltering
are now closer to the inputs in the following sense:Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.