giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
858 stars 175 forks source link

Fix hypothesis for mapper nerve tests #638

Closed wreise closed 2 years ago

wreise commented 2 years ago

Reference issues/PRs

Fixes the tests from #636 .

Types of changes

Description I reduced the sizes of examples generated by hypothesis in nerve tests, by lowering the min_side parameter and adding a max_side parameter. I also introduced a safeguard against gaussian smoothing with sigma_pixel too large.

Checklist

wreise commented 2 years ago

I have trouble reproducing the errors from the failing tests hk_shape and hk_positive. Both tests do fail, but running the examples line by line executes normally.

wreise commented 2 years ago

The HeatKernel tests fail because there is a dimension, in which the diagram contains a single point, almost on the diagonal. Then, in heat_, step_size is very small and sampling covers a very narrow range. image is still correct and as expected. However, since sigma is set uniformly over dimensions, it might be orders of magnitude larger and the call to gaussian_filter fails.

It seems to be a bug in the logic: the bins are calculated per dimension, but a single sigma is used. @ulupo, what if we modify sampling to force it to be of the range of sigma? For example, we could make sure that we set the range to [min(x) + c*sigma, max(x) + c*sigma], for some c?

ulupo commented 2 years ago

Thanks for putting so much energy into chasing this bug @wreise! Your explanation is very clear, thanks!

I see what you suggest, and it makes sense. An alternative would be to use units of the bin size to choose sigma -- so that, in particular, the absolute value of sigma would differ by dimension automatically. So, as a user, I would set the number of bins (easy to understand since it's equivalent to setting the number of pixels) as well as the width of my Gaussians in those units, which should be fairly intuitive to grasp. What do you think?

wreise commented 2 years ago

That's an alternative I haven't considered! It amounts to thinking of sigma in the unit of "number-of-pixels", no?

I have one example in which it is convenient to have sigma in the units of the data. Consider classification of noisy point clouds, using ph in various dimensions. I think you'd want to choose sigma of the order of the noise. Although, the paper for persistence Images says that the variance doesn't matter...

I have a third option, which involves the least changes. Going back to the example I described in the previous comment with sigma huge and the range small, what you'd expect as a result of gaussian_filter is just a constant vector. I think I can come up with a numerically-safe gaussian_filter method, which would do just that. It amounts to choosing a threshold for sigma_pixel, above which you approximate gaussian_filter(x, sigma) with a constant vector. This threshold would be purely for numerical precision and, imo, should be hard coded. What would you say?

ulupo commented 2 years ago

Thanks again! Some good thoughts there. I like the idea of making the least changes and your suggestion seems smart to me. Please feel free to try it!

ulupo commented 2 years ago

There's still an HK test that fails (I think this is one of those flaky tests that used to fail in the past already). Could it be easy to solve with your latest technology? https://dev.azure.com/maintainers/Giotto/_build/results?buildId=3619&view=logs&j=f1a0a6a1-f9c7-5c0b-02a6-90f84662480b&t=bf059a84-475e-57cb-5aef-ee9999aa7f8f&l=478

ulupo commented 2 years ago

@wreise an issue with wheels for macOS was revealed by this PR since we were not able to get to the install wheels step in recent Azure runs! I hope my latest commit in this PR fixes things.