juliasilge / widyr

Widen, process, and re-tidy a dataset
http://juliasilge.github.io/widyr/
Other
327 stars 29 forks source link

New feature: Multi Dimensional Scaling #18

Open kanishkamisra opened 6 years ago

kanishkamisra commented 6 years ago

Since widyr has two distance based functions, namely pairwise_delta and pairwise_dist, I was wondering if we can take it a step further and add functionality for the data analyst/scientist to scale these pairwise distances to k dimensions. This is done in base R using cmdscale that takes a distance matrix and returns a matrix with each item as a row and each of the k dimensions as columns. The points are calculated such that the distance is maintained.

Example:

euro_mat <- as.matrix(eurodist)[1:5, 1:5]

euro_mat
#>           Athens Barcelona Brussels Calais Cherbourg
#> Athens         0      3313     2963   3175      3339
#> Barcelona   3313         0     1318   1326      1294
#> Brussels    2963      1318        0    204       583
#> Calais      3175      1326      204      0       460
#> Cherbourg   3339      1294      583    460         0

cmdscale(euro_mat)
#>                [,1]       [,2]
#> Athens    2515.1944   14.68566
#> Barcelona -659.3761  961.30504
#> Brussels  -422.4595 -328.35478
#> Calais    -633.6209 -359.57342
#> Cherbourg -799.7379 -288.06249

In the tidy, widy way, I figured the following way to do this when we have a distance based result from a widyr method:

library(tidytext)
library(janeaustenr)
library(widyr)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

austen_dist <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word) %>%
  pairwise_dist(book, word, n)

## From utils.R in widyr package (private method), didnt want to use :::
col_name <- function(x, default = stop("Please supply column name", call. = FALSE))
{
  if (is.character(x))
    return(x)
  if (identical(x, quote(expr = )))
    return(default)
  if (is.name(x))
    return(as.character(x))
  if (is.null(x))
    return(x)
  stop("Invalid column specification", call. = FALSE)
}

multi_scale <- function(tbl, item1, item2, value, k = 2, ...) {
  multi_scale_(tbl,
               col_name(substitute(item1)),
               col_name(substitute(item2)),
               col_name(substitute(value)),
               k = 2, ...)
}

multi_scale_ <- function(tbl, item1, item2, value, k = 2, ...) {
  tbl_matrix <- tbl %>%
    spread(item2, col_name(value), fill = 0) %>%
    remove_rownames() %>%
    column_to_rownames("item1") %>%
    as.matrix()

  cmdscale(tbl_matrix, k = k) %>%
    as.data.frame() %>%
    rownames_to_column("item") %>%
    as.tibble()
}

austen_dist %>%
  multi_scale(item1, item2, distance)
#> Warning: Setting row names on a tibble is deprecated.
#> # A tibble: 6 x 3
#>   item                     V1      V2
#>   <chr>                 <dbl>   <dbl>
#> 1 Sense & Sensibility   211.    302. 
#> 2 Pride & Prejudice      36.7   -64.8
#> 3 Mansfield Park      -3942.   1325. 
#> 4 Emma                -3513.  -1525. 
#> 5 Northanger Abbey     3924.   -115. 
#> 6 Persuasion           3284.     77.5

This can be used for something like this:

austen_dist %>%
  multi_scale(item1, item2, distance) %>%
  ggplot(aes(V1, V2, color = item)) +
  geom_point(size = 3, alpha = 0.8) + 
  scale_y_continuous(limits = c(-4000, 4000)) + 
  theme_bw()

multi_scale

Now, this method returns a warning because of column_to_rownames() that I would like to get rid of. I'm searching for one but wanted to know your thoughts before I submit a PR.

kanishkamisra commented 6 years ago

Adding a as.data.frame() before converting column to row names solves the warning problem!

kanishkamisra commented 6 years ago

Another problem is that if the distance tbl is formed using upper = FALSE, this will return wrong results..