scikit-tda / kepler-mapper

Kepler Mapper: A flexible Python implementation of the Mapper algorithm.
https://kepler-mapper.scikit-tda.org
MIT License
623 stars 180 forks source link

Better visualizations #25

Closed MLWave closed 6 years ago

MLWave commented 6 years ago

Currently:

Implement/Research:

Re-visit:

MLWave commented 6 years ago

"mock"

MLWave commented 6 years ago

Enabled:

MLWave commented 6 years ago

Interesting, no clue if TDA legal:

If you use nr_cubes (which should perhaps be n_cubes to be more in line with sklearn API n_ prefix) of 1, you can also just use KeplerMapper for cluster algo analysis. Next to building maps we could also support buildings dendrograms (unrooted trees), or have these dendrograms build from node members and expand on node click.

sauln commented 6 years ago

very cool. those edits look really great. I look forward to exploring them all more.

These are awesome ideas for application of mapper.

Visualizing dendrograms along side the map could be very cool. I think that information would greatly help in tuning the parameters quickly.

Also, it would be very cool to be able to export the print friendly version from the browser version. That way you could rearrange the map to look how you want. Maybe the print version already has the ability to do this?

yuzuhikorunrun commented 6 years ago

these new updates are so cool. thanks guys. can't wait to try them out. btw I think we are almost 80% similar to the ayasdi platform.

MLWave commented 6 years ago

I'm not sure what this means. Would anything change besides reordering?

I mean rank scaling (with something like ECDF or scipy.stats.rankdata). This makes the points that fall inside an interval way more uniform.

Lens = age [1,87,88,89]. Rankscaled Lens = [0, 1, 2, 3]. n_cubes = 2. Buckets with normal lens: [1] [87, 88, 89], buckets with rankscaled lens: [1, 87] [88, 89].

It sounds like we could use the mapping of a NN as a lens.

Yes, if we use FaceNet for instance, this create 128-dimensional embeddings of faces. Now if we somehow turn this is into good clusters, we can say for a certain cluster (old men with beards): These embedding dimensions are most active. So we could semi-automatically assign categories/clusters to neural/embedding activity (this (subset of) neuron(s) fires mostly on beards). This comes close to an old research project of mine: Zero-Shot Communication with ConvNets. Getting a bit too detailed, but if you feed it well-labeled images to predict (Google Image Search: "men with beards"), the net can learn to explain/describe its predictions in natural language. You can also communicate with it in natural language: "A woman was robbed inside the mall, she described her attacker as an older man with a beard and a ponytail. Show me all old men with beards and ponytails inside the mall right now.". Combine this with LIME and you have a solid explainable/justifiable black-box (https://github.com/marcotcr/lime). In essence, the same would work for XGBoost, and you could even use explanations themselves as a lens.

Also, it would be very cool to be able to export the print friendly version from the browser version.

For now, I suppose this is a screenshot (you can change from Display to Print mode inside the vizualization). Later we could have the option for mapper.map to output a NetworkX object instead of a Dict. Then all graph/network visualization scripts a researcher already has would work.

Shall we open a single issue to brainstorm about the future? Or perhaps a Google Groups is better for this?

MLWave commented 6 years ago

Findings:

I propose I finish current revamped visualizations, and revisit templates and WebGL in the future, maybe when we switch to a dynamic front-end with local web server.

sauln commented 6 years ago

It sounds like the rank scaling would be equivalent to density based variable size bins. Do know off hand any ways to accomplish this is multiple dimensions?

It sounds like a Google Group could be a better forum for this discussion. There are a lot of ideas floating around over multiple issue threads. It would be nice to give each one their own home.

Do you have any suggested material I could use to get caught up on these zero-shot communication ideas?

MLWave commented 6 years ago

Paper: https://ai2-s2-pdfs.s3.amazonaws.com/4b18/303edf701e41a288da36f8f1ba129da67eb7.pdf

Given the simplicity of the approach, there are many different research lines that can be pursued. In this work we focus on semantically meaningful attributes, but the development of similar ideas applied to word embeddings as in (Frome et al., 2013), is both promising and straightforward within this framework. Another interesting research line is to study the addition of non-linearities and more layers into the model, leading to a deep neural network where the top layer is fixed and interchangeable, and all the remaining layers are learned.

Code: https://github.com/MLWave/extremely-simple-one-shot-learning (Imagine the V matrix as a lens and the W matrix as the final layer of a Deep Net).

MLWave commented 6 years ago

It sounds like the rank scaling would be equivalent to density based variable size bins. Do know off hand any ways to accomplish this is multiple dimensions?

https://github.com/MLWave/CloudToGrid and https://github.com/MLWave/RasterFairy

img

(I myself have experimented with 2-D square SOMs, this extends to multiple dimensions (during fitting each bin is a neuron that fires when its content is closest to the learning sample, then all the data is mapped to the SOM, in a first come, first serve manner)).

IE: http://mlwave.github.io/som/

I realize this further lossy compresses the projection and destroys shape, but it is useful for using plain cubical coverings for business analysis. Projection may be square [age, spend], cover = Cubical(n_cubes=[10,2]), clusterer = density based on inverse_X. Now you have clusters of high and low spenders, for 10 equidistant age bins, for all your customers, with insight on what structures these clusters.

MLWave commented 6 years ago

Zero-Gravity Mode for easier manipulation of graphs/flares:

img

sauln commented 6 years ago

These visualizations look great. Do you have a schedule for integrating them into master?

I like that name ‘cubical’ that you’re using for the default cover. I’ll change the name and try to get some documentation incorporated over the next few days.

What do you think about branch schemes? It seems like it would be important to have master head match whatever is on pypi, and dev on a different branch.

MLWave commented 6 years ago

I wanted to make a pull request tonight. Tying things up now.

What do you think about branch schemes? It seems like it would be important to have master head match whatever is on pypi, and dev on a different branch.

This sounds like a good idea. I am not versed in working with open source software like this, so please take a lead on these things.

Right now I have my stuff done in a branch, only to .visualize(). I think I need to check-out latest from github, branch it again, do changes, and make pull request? I am not familiar with branch schemes, but if I understand we can have kepler-mapper and kepler-mapper-dev, and just release from there in schedule. That sounds optimal!

sauln commented 6 years ago

I am not versed in working with open source software like this,

I am in the same boat. I'll look around at some popular open source projects and see how they do it. Please don't wait on me to integrate your changes. I imagine it would work as you said. We could set dev to be the default branch (so pull requests automatically go to it) and have master be protected, only pushing to master right before we release.

I think to integrate master into your branch you should be able to get away with

git fetch origin
git merge origin master

and that should be enough to merge in all the new changes.

MLWave commented 6 years ago

1500 nodes and 10k edges with dropshadow: img "Freezing gravity" animates to: img

MLWave commented 6 years ago

Generative Adversarial Networks can turn horses into art and zebra's, but they can't do this: img 7500 nodes, 40000 edges

MLWave commented 6 years ago

If I understand correctly a nerve is a certain method to connect the clusters. We currently support "Graph Link" which connects clusters when their members show overlap.

I've experimented with "stacking" an hierarchical tree on top of the clusters found with TDA. From the mean of the features in the clusters a distance matrix is build as input for hierarchical clustering:

img http://mlwave.github.io/tda/quartettree-28digits.html

Would an hierarchical cluster connection count as a "nerve"? Could one do both: First Graph Link then an hierarchical nerve? It may make flares more prominent.

Finally: Is an overlap percentage > 100% legal? For some Mapper settings it also seems to promote interesting flares.

sauln commented 6 years ago

If I understand correctly a nerve is a certain method to connect the clusters. We currently support "Graph Link" which connects clusters when their members show overlap.

Essentially, yes. The nerve is a way to construct a simplicial complex from a cover. For each n wise intersection, you add an n-simplex. Mapper estimates elements of a cover by clusters. We only support building the 1-skeleton via pair wise intersections and that produces a graph.

Would an hierarchical cluster connection count as a "nerve"?

I'm not following how you build the heirarchy? The multimapper is constructed from a hierarchy of codomain covers, but the nerve is built from the pullback of that hierarchy of covers. Are you building a hierarchy in the domain?

Is an overlap percentage > 100% legal? For some Mapper settings it also seems to promote interesting flares.

If two bins have a 100% overlap, then they should be equivalent. It might be a bug in the way we create the bins that are causing this. I'm probably confusing myself..

overlap_comparison

I've put together a brief diagram of what I think is going on (it is not exact). I haven't actually looked at the code, but I suspect 100% means distance between bin centers = width of overlap. If this is true, there should be no problem going much higher than 100%.

Edit: the final arrow should only go to the 3rd grid line.

MLWave commented 6 years ago

Thanks for staying patient with my high school maths :). I certainly need to get up to speed with the correct terminology and formal notation. Your article on TDA/Mapper and clear explanations help with this.

My idea was something like this:

img

We map the data with 2 cubes per dimension (X,Y axis), overlap of 10%.

A ) 1-Skeleton on pair-wise intersections B) Quartet Tree/Cayley Tree based on cluster quartet distance matrix. Introduces two empty hierarchy nodes. C) Combining both A and B

So: Global hierarchical clustering on a local density-based clustering guided by a filter function?

If anything, we can make the quartet tree edges completely transparent, so it is just there to provide some extra structure for the layout. Resulting graphs with this method are always fully connected, no floating nodes (resuling in each graph having its own "gravity", so you can move-drag the entire graph, which is useful for drawing multiple graphs on screen).

sauln commented 6 years ago

I'm glad someone has read them :man_dancing:. In a few weeks I'm hoping to push out a few more.

It sounds to me like this is similar to the multmapper, but maybe views the higher levels with more transparency. I like being able to view the fully connected version at the same time as the high resolution version. It would be very helpful to know where a disconnected component would connect if you decreased resolution.

sauln commented 6 years ago

Still having trouble with pep8speaks?

We should plan a release 1.0.2 with the updated visualization, an official docs setup, and new cover and nerve builder changes.

I can sit with the dev branch for a bit this weekend and try to get everything polished.

MLWave commented 6 years ago

Just figured out how to do small changes to the dev branch without Travis-CI running every time. Tonight, I'll add the old parameters to visualize, and increase the test coverage.

PEP8Speaks is a necessary "evil". I need a way to have these hints when coding, not after I written 500+ lines of PEP8-invalid code :). I'll look at installing a plug-in and having this closer to my IDE. Increasing the line-width to 100 already muted a lot of violations.

It would be very helpful to know where a disconnected component would connect if you decreased resolution.

Cool, I'll start with the simplest version of hierarchical clustering (single-linkage) and depending on usefulness we can look at more advanced hierarchical clusterings, like the Minimum Quartet Tree Cost.

sauln commented 6 years ago

I'll look at installing a plug-in and having this closer to my IDE.

I use atom, and the atombeaufity plug-in works great. I'd be surprised if most editors had something. Which do you use?

PEP8Speaks supposedly can issue a pull request with fixes to violations. That would be very cool if it worked properly. I haven't tested it yet.

I'm interested in what you come up with. I've been playing around with the multimapper, which essentially does the same thing, just over multiple levels. it sounds like what you're talking about would be a multimapper, but at just two steps (the last step being set with 1 bin).

sauln commented 6 years ago

I'm closing this issue as the updated UI is on dev branch now.