Add normalization of entropy

wreise commented 4 years ago

Reference issues/PRs

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Description Add the option to normalize the entropy returned by PersistenceEntropy, according to the heuristic proposed by Myers et Al. (2019) in eq. 4. Up to now, the calculated entropy wass the one in eq. 3.

Screenshots (if appropriate)

Any other comments? Fix indentation in mapper visualization test.

Checklist

[X] I have read the guidelines for contributing.
[X] My code follows the code style of this project. I used flake8 to check my Python changes.
[X] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[ ] I have added tests to cover my changes.
[x] All new and existing tests passed. I used pytest to check this on Python tests.

wreise commented 4 years ago

I hope I addressed all the issues!

ulupo commented 4 years ago

I think we should edit the code here and anywhere else we use entropy (in particular gtda/mapper/filter.py for Entropy and gtda/time_series/features.py for PersistenceEntropy) to use scipy.stats.entropy. It is suitable for 2D arrays and arbitrary bases, handles zeros well (so we don't have to silence warnings), and is a one-liner!

There is a side-effect in the particular case of PersistenceEntropy because we then have to calculate the quantity now called lifespan_sums again (and this is technically computed inside the code for scipy.stats.entropy).

wreise commented 4 years ago

Ups, there is one failing test gtda/tests/test_pipeline.py::test_grid_search_time_series. Not ready yet.

MonkeyBreaker commented 4 years ago

@wreise PR #451 resolves the current building issues on the pipeline.

You encounter issues because the pipeline download latest master version of pybind11, and their latest changes broke compilation even on my local machine. I'm currently looking for a solution but on the meantime, the compilation pipelines work again :)

ulupo commented 4 years ago

@wreise I think the failure in the tests exposes the following issues:

When there is no non-trivial point, scipy.stats.entropy computes the entropy as np.nan (and throws a RunTimeWarning). This is quite meaningful in my opinion, but will typically lead to errors if the outputs are fed to downstream estimators/transformers (one reason is that typically these catch arrays with NaNs via check_array).
This is making me think again about what we've been doing in the current implementation: np.nan_to_num has been taking care of these scenarios by changing np.nans to zeros silently for the user. While this avoids later exceptions, it is quite dangerous potentially for downstream estimators built around these features. There is a huge difference between e.g. saying that a persistence diagram in degree 1 consists of only one non-trivial feature, and saying that it has only trivial features. But a (for instance) classifier built on top of a persistence entropy feature simply wouldn't have a way of telling these apart in the current implementation. So I think the current approach needs revisiting.
One idea would be to introduce an extra parameter, say e.g. nan (to copy np.nan_to_num). This could default to None which means no NaN filling (at the user's risk), but could also allow arbitrary floats.
I wonder how we should handle warnings in these cases. If we opt for always silencing warnings, the documentation should be very clear about the default behaviour (whichever it is).

What do you think?

giotto-ai / giotto-tda

Add normalization of entropy #450