Open dylanbeaudette opened 5 years ago
I have no idea of this is valid, but seems reasonable:
# using `d` from above...
cor(data.frame(lapply(d, as.integer)), use = 'complete.obs', method='spearman')
This would treat all factors as ordered factors, which may not be a valid assumption. Then again, correlation is somewhat arbitrary when considering nominal data in the absence of class-wise similarity.
And of course, a little research yields a solution in R: GoodmanKrustal.
library(GoodmanKruskal)
d <- data.frame(
x=sample(letters[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
y=sample(LETTERS[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25)),
z=sample(LETTERS[21:24], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25))
)
d$x <- factor(d$x)
d$y <- factor(d$y)
d$z <- factor(d$z)
(res <- GKtauDataframe(d))
plot(res)
A fine "correlation matrix", pending some further reading of course.
x y z
x 4.000 0.018 0.018
y 0.020 4.000 0.021
z 0.013 0.019 4.000
One final test: categorical association before / after cLHS. Note that there is a dummy continuous variable in there to avoid the error associated with cor
.
Before / after are pretty close. This is a contrived example, so the absence of association may not be a good diagnostic.
library(clhs)
library(GoodmanKruskal)
d <- data.frame(
x=sample(letters[1:4], size = 1000, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
y=sample(LETTERS[1:4], size = 1000, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25)),
z=rnorm(n = 1000)
)
d$x <- factor(d$x)
d$y <- factor(d$y)
res <- clhs(d, size=100, simple = TRUE)
# source
GKtauDataframe(d[, c('x', 'y')])
# cLHS
GKtauDataframe(d[res, c('x', 'y')])
A final thought. Pair-wise Cramer's V
may be more appropriate given that the correlation matrix is symmetric. The asymmetric nature of the Goodman-Kruskal tau
statistic may be harder to interpret along-side a traditional correlation matrix developed from continuous values. The GoodmanKrustal pacakge has some interesting ideas on how to develop a correlation matrix from a mixture of categorical and continuous variables.
I think that
clhs
should function without continuous variables, currently an error is encountered when attempting to compute correlation.Adding a condition for no continuous data would help, but correlation would still need to be computed (I think).
vcd::assocstats()
could be use to compute correlation from a cross-tabulation of all factors. I don't know how to adapt or interpret Cramer's V in the context of more than 2 factors.This post has some great ideas on efficient calculation of all pair-wise
V
.