jasongraf1 / VADIS

Package for the Variation-Based Distance & Similarity Modeling (VADIS) method
GNU General Public License v3.0
1 stars 0 forks source link

Variable ranking should be ran across columns? #1

Closed montesmariana closed 2 years ago

montesmariana commented 2 years ago

Hi Jason! I was looking into some results from this package and I found something weird in the rank table output of the third VADIS line. It seems to be ranking the wrong dimensions (or I totally missed the point of the function.) Luckily it's just used for illustration, it doesn't affect the results, but it is still useful for interpretation and this output can be confusing...

The line of code I'm talking about is this: https://github.com/jasongraf1/VADIS/blob/33ce6edf5a5a327a6a69c56662be6cacdd8277f8/R/vadis_line3.R#L49

Here is a reprex showing the output and how I think the line should be changed:

library(VADIS)
library(ranger)

data("particle_verbs_short")
data_list <- split(particle_verbs_short, particle_verbs_short$Variety, drop = TRUE)

fmla <- Response ~ DirObjWordLength + DirObjDefiniteness + DirObjGivenness + DirObjConcreteness + DirObjThematicity + DirectionalPP + PrimeType + Semantics + Surprisal.P + Surprisal.V + Register

rf_func <- function(x) ranger(fmla, data = x, importance = "permutation")

rf_list <- lapply(data_list, rf_func)
names(rf_list) <- names(data_list)

line3 <- vadis_line3(rf_list, path = FALSE)
line3$rank.table # rank seems to be rowwise instead of columnwise
#>                    CA GB HK IE IN JA NZ PH SG
#> DirObjWordLength    2  3  7  5  9  6  1  4  8
#> DirObjDefiniteness  6  2  8  3  4  9  1  7  5
#> DirObjGivenness     6  4  1  3  9  5  7  2  8
#> DirObjConcreteness  5  2  8  1  4  9  3  6  7
#> DirObjThematicity   1  3  7  9  8  5  2  4  6
#> DirectionalPP       1  3  5  6  8  4  2  9  7
#> PrimeType           7  6  2  1  4  3  5  8  9
#> Semantics           1  3  5  4  9  8  2  6  7
#> Surprisal.P         4  3  5  2  9  6  1  8  7
#> Surprisal.V         4  7  3  2  8  6  1  9  5
#> Register            4  2  8  1  5  9  3  6  7

# Getting raw variable importances
raw_tab <- line3$varimp.table

# Line 49 of vadis_line3.R
# Varieties are ranked within predictors
# - The maximum number is the number of varieties
# - Numbers are repeated in the columns and not in the rows
t(as.data.frame(apply(raw_tab, 1, function(x) rank(-x))))
#>                    CA GB HK IE IN JA NZ PH SG
#> DirObjWordLength    2  3  7  5  9  6  1  4  8
#> DirObjDefiniteness  6  2  8  3  4  9  1  7  5
#> DirObjGivenness     6  4  1  3  9  5  7  2  8
#> DirObjConcreteness  5  2  8  1  4  9  3  6  7
#> DirObjThematicity   1  3  7  9  8  5  2  4  6
#> DirectionalPP       1  3  5  6  8  4  2  9  7
#> PrimeType           7  6  2  1  4  3  5  8  9
#> Semantics           1  3  5  4  9  8  2  6  7
#> Surprisal.P         4  3  5  2  9  6  1  8  7
#> Surprisal.V         4  7  3  2  8  6  1  9  5
#> Register            4  2  8  1  5  9  3  6  7

# Predictors are ranked within varieties
# - The maximum number is the number of predictors
# - Numbers are repeated within rows, not within columns
apply(raw_tab, 2, function(x) rank(-x))
#>                    CA GB HK IE IN JA NZ PH SG
#> DirObjWordLength    3  4  4  5  9  4  4  2  6
#> DirObjDefiniteness  9  9 11  7  6 11  9  9  9
#> DirObjGivenness    10 10  8 10 11 10 11  8 10
#> DirObjConcreteness  8  8  9  6  4  8  8  7  8
#> DirObjThematicity   6  6  7 11  7  5  6  5  5
#> DirectionalPP       7  7  5  8  8  3  7 10  7
#> PrimeType          11 11 10  9 10  9 10 11 11
#> Semantics           2  3  3  4  5  6  2  4  3
#> Surprisal.P         1  1  1  1  1  1  1  1  1
#> Surprisal.V         5  5  2  3  2  2  3  3  2
#> Register            4  2  6  2  3  7  5  6  4

Created on 2022-06-14 by the reprex package (v2.0.1)

Thank you very much, great work!

jasongraf1 commented 2 years ago

Thanks, Mariana! I've fixed it now and vadis_line3(...)$rank.table should show the correct rankings. Fortunately, this is only a minor error, and doesn't affect the distance/correlation matrix calculations!