error when input data are only factors

dylanbeaudette commented 5 years ago

I think that clhs should function without continuous variables, currently an error is encountered when attempting to compute correlation.

library(clhs)

d <- data.frame(
  x=sample(letters[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
  y=sample(LETTERS[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25))
)

d$x <- factor(d$x)
d$y <- factor(d$y)

# error
res <- clhs(d, size=10, simple = FALSE)

Error in cor(data_continuous, use = "complete.obs") : 
  no complete element pairs

Adding a condition for no continuous data would help, but correlation would still need to be computed (I think). vcd::assocstats() could be use to compute correlation from a cross-tabulation of all factors. I don't know how to adapt or interpret Cramer's V in the context of more than 2 factors.

library(vcd)

d <- data.frame(
  x=sample(letters[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
  y=sample(LETTERS[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25)),
  z=sample(LETTERS[21:24], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25))
)

d$x <- factor(d$x)
d$y <- factor(d$y)
d$z <- factor(d$z)

# single pair-wise `V`
tab <- table(d$x, d$y)
assocstats(tab)

This post has some great ideas on efficient calculation of all pair-wise V.

dylanbeaudette commented 5 years ago

I have no idea of this is valid, but seems reasonable:

convert factors to integer representation
compute Spearman rank correlation from integer codes

# using `d` from above...
cor(data.frame(lapply(d, as.integer)), use = 'complete.obs', method='spearman')

This would treat all factors as ordered factors, which may not be a valid assumption. Then again, correlation is somewhat arbitrary when considering nominal data in the absence of class-wise similarity.

dylanbeaudette commented 5 years ago

And of course, a little research yields a solution in R: GoodmanKrustal.


library(GoodmanKruskal)

d <- data.frame(
  x=sample(letters[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
  y=sample(LETTERS[1:4], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25)),
  z=sample(LETTERS[21:24], size = 100, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25))
)

d$x <- factor(d$x)
d$y <- factor(d$y)
d$z <- factor(d$z)

(res <- GKtauDataframe(d))
plot(res)

A fine "correlation matrix", pending some further reading of course.

      x     y     z
x 4.000 0.018 0.018
y 0.020 4.000 0.021
z 0.013 0.019 4.000

dylanbeaudette commented 5 years ago

One final test: categorical association before / after cLHS. Note that there is a dummy continuous variable in there to avoid the error associated with cor.

Before / after are pretty close. This is a contrived example, so the absence of association may not be a good diagnostic.

library(clhs)
library(GoodmanKruskal)

d <- data.frame(
  x=sample(letters[1:4], size = 1000, replace = TRUE, prob = c(0.25, 0.25, 0.05, 0.15)),
  y=sample(LETTERS[1:4], size = 1000, replace = TRUE, prob = c(0.25, 0.25, 0.25, 0.25)),
  z=rnorm(n = 1000)
  )

d$x <- factor(d$x)
d$y <- factor(d$y)

res <- clhs(d, size=100, simple = TRUE)

# source
GKtauDataframe(d[, c('x', 'y')])

# cLHS
GKtauDataframe(d[res, c('x', 'y')])

dylanbeaudette commented 5 years ago

A final thought. Pair-wise Cramer's V may be more appropriate given that the correlation matrix is symmetric. The asymmetric nature of the Goodman-Kruskal tau statistic may be harder to interpret along-side a traditional correlation matrix developed from continuous values. The GoodmanKrustal pacakge has some interesting ideas on how to develop a correlation matrix from a mixture of categorical and continuous variables.

pierreroudier / clhs

error when input data are only factors #13