leifeld / dna

Discourse Network Analyzer (DNA)
126 stars 41 forks source link

Disentangle functionalities of dna_cluster #178

Open leifeld opened 5 years ago

leifeld commented 5 years ago

dna_cluster does many things at once and is not very user-friendly. The different things should probably be accommodated in separate functions. We should think about a clever way to design the overall user interface for these functionalities, also in relation to the other plotting functions.

leifeld commented 5 years ago

@JBGruber I have added a new dna_multiclust function. It can serve as a replacement for both the dna_timeWindow and the dna_cluster function.

It takes a DNA connection and many of the same arguments as dna_network and applies a range of different cluster methods to the network. It returns the cluster memberships, modularity, and optionally the original hclust (or similar) object.

We can now start replacing the cluster functionality in dna_plotNetwork, dna_cluster etc. by using this function internally. It is possible to switch each cluster method on or off separately and to set a specific k or leave k flexible. The return object contains everything we need.

The function also replaces the time window functionality. This is done by using the same timewindow and windowsize arguments as in the dna_network function. If this is done, the cluster methods are just applied to all time points, and the modularity, cluster memberships, and cluster objects are returned for all those time points (as long data frames in long format in most cases). It also returns a data frame that indicates the cluster method and corresponding modularity that maximizes modularity at any given time step.

If time windows are used, the resulting objects can be plotted as a time series using the new dna_plotModularity function. The plots look very similar to the plots returned by dna_timeWindow before, except it's now possible to separate analysis from plotting with these two functions and customize exactly what we want.

The dna_plotModularity function also has a pretty cool anomaly detection function, based on the anomalize package, which highlights seasonality- and trend-corrected anomalies in the time series in red and then draws an anomaly-corrected smoothed time series curve.

How do we go about updating the dna_cluster and dna_plotNetwork etc. functions?

leifeld commented 5 years ago

I should add that the primary advantage of the dna_multiclust function is that it can automatically select the best cluster measure. Similar to the nicely option for layouts, just for clusters. So I guess this should be the default way to cluster discourse networks in the other functions, and the user could change that to some of the other measures.

leifeld commented 5 years ago

I have taken a first stab at creating a new dna_dendrogram function in commit 3bf79944bf9c60cb3b5092036ec9ab2509ed3af1. This is a first step towards removing the relatively convoluted dna_cluster architecture, which tries to do everything at the same time.

Like the dna_multiclust function, dna_dendrogram has most of the customization options of the dna_network function, like excludeValues, variable1 etc. I think this is more user-friendly than having everything in the ... argument.

The function uses dna_multiclust internally, which means that it's now possible to automatically select the best cluster solution among several methods and k values using modularity. But it's also still possible to select the method and k manually. The downside of this approach is of course that analysis and visualization are no longer strictly separated. But I think the benefits outweigh the drawbacks.

One of the design principles for this function is that the user can set all sorts of colors separately and autonomously, such as leaves, labels, symbols, and rectangles, as well as symbol shapes. No more color replacement by ggplot2 using aes, which always caused us to lose the original colors coded in the DNA database. It's possible to supply a single color for all labels/leaves/symbols/rectangles or as many as there are clusters or as many as there are labels. It's also possible to select from the attributes and from the cluster memberships for automatic color selection. The drawback here is that the user needs to come up with nice colors. And the legends are gone.

@JBGruber: The function currently has the following issues. If you have some time, it would be great if you could advise on them:

  1. There are no legends. I don't know how to create them when identity colors are used all the time. And it's also not always sensible, e.g., when the user provides custom colors. It only makes sense if the colors are taken from the attributes or cluster memberships.
  2. coord_flip does not work properly anymore because axis ticks are added to the wrong axis and the x axis is colored rather than the y axis. No idea why. If no colors are used for the labels, I think it should work like before, though.
  3. The argument names are now internally consistent but may deviate from the naming conventions of the other functions. This needs to be checked for the other functions at some point when we revise them.

The help file contains some usage examples.