Open leifeld opened 5 years ago
@JBGruber I have added a new dna_multiclust
function. It can serve as a replacement for both the dna_timeWindow
and the dna_cluster
function.
It takes a DNA connection and many of the same arguments as dna_network
and applies a range of different cluster methods to the network. It returns the cluster memberships, modularity, and optionally the original hclust
(or similar) object.
We can now start replacing the cluster functionality in dna_plotNetwork
, dna_cluster
etc. by using this function internally. It is possible to switch each cluster method on or off separately and to set a specific k
or leave k
flexible. The return object contains everything we need.
The function also replaces the time window functionality. This is done by using the same timewindow
and windowsize
arguments as in the dna_network
function. If this is done, the cluster methods are just applied to all time points, and the modularity, cluster memberships, and cluster objects are returned for all those time points (as long data frames in long format in most cases). It also returns a data frame that indicates the cluster method and corresponding modularity that maximizes modularity at any given time step.
If time windows are used, the resulting objects can be plotted as a time series using the new dna_plotModularity
function. The plots look very similar to the plots returned by dna_timeWindow
before, except it's now possible to separate analysis from plotting with these two functions and customize exactly what we want.
The dna_plotModularity
function also has a pretty cool anomaly detection function, based on the anomalize
package, which highlights seasonality- and trend-corrected anomalies in the time series in red and then draws an anomaly-corrected smoothed time series curve.
How do we go about updating the dna_cluster
and dna_plotNetwork
etc. functions?
I should add that the primary advantage of the dna_multiclust
function is that it can automatically select the best cluster measure. Similar to the nicely
option for layouts, just for clusters. So I guess this should be the default way to cluster discourse networks in the other functions, and the user could change that to some of the other measures.
I have taken a first stab at creating a new dna_dendrogram
function in commit 3bf79944bf9c60cb3b5092036ec9ab2509ed3af1. This is a first step towards removing the relatively convoluted dna_cluster
architecture, which tries to do everything at the same time.
Like the dna_multiclust
function, dna_dendrogram
has most of the customization options of the dna_network
function, like excludeValues
, variable1
etc. I think this is more user-friendly than having everything in the ...
argument.
The function uses dna_multiclust
internally, which means that it's now possible to automatically select the best cluster solution among several methods and k
values using modularity. But it's also still possible to select the method and k
manually. The downside of this approach is of course that analysis and visualization are no longer strictly separated. But I think the benefits outweigh the drawbacks.
One of the design principles for this function is that the user can set all sorts of colors separately and autonomously, such as leaves, labels, symbols, and rectangles, as well as symbol shapes. No more color replacement by ggplot2
using aes
, which always caused us to lose the original colors coded in the DNA database. It's possible to supply a single color for all labels/leaves/symbols/rectangles or as many as there are clusters or as many as there are labels. It's also possible to select from the attributes and from the cluster memberships for automatic color selection. The drawback here is that the user needs to come up with nice colors. And the legends are gone.
@JBGruber: The function currently has the following issues. If you have some time, it would be great if you could advise on them:
coord_flip
does not work properly anymore because axis ticks are added to the wrong axis and the x axis is colored rather than the y axis. No idea why. If no colors are used for the labels, I think it should work like before, though.The help file contains some usage examples.
dna_cluster
does many things at once and is not very user-friendly. The different things should probably be accommodated in separate functions. We should think about a clever way to design the overall user interface for these functionalities, also in relation to the other plotting functions.