giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
858 stars 175 forks source link

Remove incorrect assumptions in Filtering #436

Closed ulupo closed 4 years ago

ulupo commented 4 years ago

Reference issues/PRs Fixes #91. No assumptions are made on the nature of birth/death values beyond the fact that birth >= death.

Types of changes

Description Arrays are no longer sorted by lifetime using the _sort utility function before calling _filter in Filtering's transform. Since _sort was not needed anywhere else, it has been removed completely from the codebase. The fact that no sorting is made before filtering means that the outputs of Filtering are now closer to the inputs in the following sense:

Checklist

gtauzin commented 4 years ago

If I am not wrong, the benefit of sorting the array was to be able to free as much memory as possible. The diagrams data structure is already extremely heavy as it is padded and includes the dimensions for each point.

You may want to try to argsort by persistence instead and only keep the points the necessary points so that you a minimally padded array.

ulupo commented 4 years ago

Thanks @gtauzin! Indeed I was hasty and did not think carefully enough about it before pushing. I have now pushed an alternative method which achieves the same reduction in size as before, but makes more minimal calculations than using argsort, resulting in better performance relative to the current implementation : for instance, generating fake input of shape (10000, 2000, 3) where, for each entry, 1000 persistence pairs are in dimension 0 and 1000 are in dimension 1, I observe a 5x speedup when averaging over various values of the cutoff.

UPDATE: The performance figures are incorrect, the savings are there but are much more modest (roughly 40% speedup in the previous example). Profiling the code shows that indeed argsort becomes more and more burdensome as the size of the data grows larger. The reason that the net gains are not more spectacular is that argsort allows to use simple array slices downstream, which I believe is faster than the indexing made necessary by the current approach.

Still, I advocate for the new approach at least for the following reason: a persistence pair appears later than another in the output filtered array if and only if it appeared later in the input array. This makes debugging/visual comparison easier and makes the output look just like the output of a modified (imaginary) version of the PH algorithm which simply did not record pairs without sufficiently large persistence.

ulupo commented 4 years ago

Thanks @gtauzin!

ulupo commented 4 years ago

@gtauzin I have now added extensive inline comments to guide through the new logic. Let me know what you think!