max hypercube sample size

MLWave commented 5 years ago

One problem is a large concentration of samples in a certain projection range. Graphs may look good for non-concentrated samples, but lack structure for the large concentrations. Also it can slow down clustering algorithms if 25.000+ samples are in a single hypercube.

I would like to be able to set a max_clustering_size. When this size is met, that hypercube should be subdivided into smaller hypercubes. Preferably, this subdivision can be set in a custom manner.

The only way to currently overcome this is to set a finer resolution, or add another dimension to the projection. These settings however drastically alter the output graphs. Another way to somewhat overcome this is to rankorder the projection (creates an even number of samples in each hypercube), but this method also ignores legit clusters by stretching them out.

Why I'd like a custom manner, is because you may use an overlap percentage of 100% or more. Recursing on that makes my brain hurt too much. Though I can think of ways to implement this, I can't think of solid API ways to add custom mapping on hypercubes. At least, not until we make those proper classes, like kmapper.cover.MultiScaleCubical(). Then max_sample_cluster and subcover=kmapper.cover.MultiScaleCubical() could be parameters for MultiScaleCubical.

sauln commented 5 years ago

I like this idea and believe making a separate cover class would be a good approach.

One thought I had was to adapt something like a Barnes-Hut tree to have an overlap parameter.

What do you mean by "not until we make those proper classes"?

MLWave commented 5 years ago

Adding it to mapper with parameters feels cluttery:

mapper.map(
    n_cubes=4, 
    perc_overlap=0.1,
    max_sample_cluster=1000, 
    sub_n_cubes=2, 
    sub_perc_overlap=0.2)

Instead, I am thinking of (but haven't fleshed out yet) something like:

mapper.map(cover=km.cover.MultiScaleCubical(n_cubes=[4,10], perc_overlap=[0.5, 0.5]))

and either adding subcover parameters to .MultiScaleCubical or .map. Or perhaps a seperate class RecursiveCubical and RecursiveMultiScaleCubical?

I really like the Barnes-Hut idea. If I understand correctly the subdivision is automatically driven by the number of samples in the splits. That seems smarter than a simple sub-mapping/evenly spaced fat partitioning.

sauln commented 5 years ago

Have you seen the newly supported interface for covers?

mapper.map(lens, X,
    cover=CubicalCover(n_cubes=10, perc_overlap=0.1)
)

I'd like to phase out the n_cubes and perc_overlap parameters in map as much as possible so we can adopt other more interesting covers, like the one you've suggested.

You're right about the Barnes-Hut. We would probably have to build a custom implementation so we can specify how many samples we want per hypercube.

Also, in the example code you posted, are you setting a different number of cubes and overlap for each dimension of the lens? If so, that feature should be supported in the normal cover.

mapper.map(
   cover=cover.CubicalCover(n_cubes=[5, 10], perc_overlap=[0.2, 0.4])
)

MLWave commented 5 years ago

Thanks, I totally missed KeplerMapper having support for that! So subtle the implementation ;).

So I made a 'km.cover.RecursiveCubicalCover()', and it works (kinda). The problem is the overlap, it quickly becomes an overlap of overlap and edge explosion (entire hypercubes may be inside the overlap between two larger hypercubes).

I think a solution may be to create subgraphs. Either attach them to the node that has met max_samples_cluster, or show these seperately as their own graphs. This would require being closer in '.map()' though.

scikit-tda / kepler-mapper

max hypercube sample size #118