giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
847 stars 173 forks source link

Add datasets module to load and generate toy datasets #345

Open lewtun opened 4 years ago

lewtun commented 4 years ago

Description

scikit-learn has a datasets module that provides handy utility functions to load and generate toy datasets. These functions feature prominently in the scikit-learn examples and it would be nice to have a similar functionality in giotto-tda.

Suggestions for synthetic datasets include:

Suggestions for point cloud and graph datasets could take inspiration from PyTorch geometric's dataset module

ammedmar commented 4 years ago

This would be good. @gtauzin and I started doing something along the make_point_clouds methods you envisioning and manage to get a few nice spaces and constructions on spaces. The reason this was not completed was the lack of uniformity of the sampling. In order to get this done well, the probability function has to be modified by a hessian term associated to the parametrization of the curved space. Maybe we can revisit this point sometime.

lewtun commented 4 years ago

Cool, it seems you guys went for the hardcore version :) All I had in mind were spheres and tori with gaussian noise added, but perhaps this is too limiting.

If you have some Python code lying around, you could make GitHub gist and link it in these comments.

ammedmar commented 4 years ago

The code is not so important, specially since it doesn't do what one would really like it to do, but since you asked, I am sending code that samples a point cloud near the real projective plane embedded in R4.

To get this thing properly done, what we need is a method that can sample an interval according to a costume, non necessarily uniform, probability distribution function. Any leads on something like this?

The first part of this notebook has the sampling functions for S2 and RP2. I just run it and the plotting still works.

wreise commented 4 years ago

I wanted to have a look at the notebook, but i do not have access rights- you should receive an email requesting them.

For sampling from arbitrary densities, something like Metropolis-Hastings? Or, if the density is represented as a discretize array, maybe inverse transform sampling?