Closed wreise closed 4 years ago
I hope I addressed all the issues!
I think we should edit the code here and anywhere else we use entropy (in particular gtda/mapper/filter.py
for Entropy
and gtda/time_series/features.py
for PersistenceEntropy
) to use scipy.stats.entropy
. It is suitable for 2D arrays and arbitrary bases, handles zeros well (so we don't have to silence warnings), and is a one-liner!
There is a side-effect in the particular case of PersistenceEntropy
because we then have to calculate the quantity now called lifespan_sums
again (and this is technically computed inside the code for scipy.stats.entropy
).
Ups, there is one failing test gtda/tests/test_pipeline.py::test_grid_search_time_series
. Not ready yet.
@wreise PR #451 resolves the current building issues on the pipeline.
You encounter issues because the pipeline download latest master version of pybind11
, and their latest changes broke compilation even on my local machine.
I'm currently looking for a solution but on the meantime, the compilation pipelines work again :)
@wreise I think the failure in the tests exposes the following issues:
When there is no non-trivial point, scipy.stats.entropy
computes the entropy as np.nan
(and throws a RunTimeWarning
). This is quite meaningful in my opinion, but will typically lead to errors if the outputs are fed to downstream estimators/transformers (one reason is that typically these catch arrays with NaNs via check_array
).
This is making me think again about what we've been doing in the current implementation: np.nan_to_num
has been taking care of these scenarios by changing np.nan
s to zeros silently for the user. While this avoids later exceptions, it is quite dangerous potentially for downstream estimators built around these features. There is a huge difference between e.g. saying that a persistence diagram in degree 1 consists of only one non-trivial feature, and saying that it has only trivial features. But a (for instance) classifier built on top of a persistence entropy feature simply wouldn't have a way of telling these apart in the current implementation. So I think the current approach needs revisiting.
One idea would be to introduce an extra parameter, say e.g. nan
(to copy np.nan_to_num
). This could default to None
which means no NaN filling (at the user's risk), but could also allow arbitrary floats.
I wonder how we should handle warnings in these cases. If we opt for always silencing warnings, the documentation should be very clear about the default behaviour (whichever it is).
What do you think?
Reference issues/PRs
Types of changes
Description Add the option to normalize the entropy returned by
PersistenceEntropy
, according to the heuristic proposed by Myers et Al. (2019) in eq. 4. Up to now, the calculated entropy wass the one in eq. 3.Screenshots (if appropriate)
Any other comments? Fix indentation in mapper visualization test.
Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.