hturner / PlackettLuce

PlackettLuce package for Plackett-Luce models in R
https://hturner.github.io/PlackettLuce/
18 stars 5 forks source link

Dealing with weak networks #50

Open kauedesousa opened 3 years ago

kauedesousa commented 3 years ago

Dear Heather,

Here comes an issue that may be related to issue https://github.com/hturner/PlackettLuce/issues/25. But now I think we have a better clue on where is the problem, which arrises mostly when we are performing cross-validations and pltree() is exposed to a set of data with a weak network.

Here is an example

library("PlackettLuce")
source("https://raw.githubusercontent.com/AgrDataSci/ClimMob-analysis/master/R/functions.R")

R <- matrix(c(1, 2, 0, 0, 3,
              4, 1, 0, 0, 2,
              2, 1, 0, 0, 3,
              1, 2, 0, 4, 3,
              2, 1, 0, 3, 4,
              4, 1, 0, 0, 2,
              2, 1, 0, 0, 3,
              1, 2, 0, 1, 3,
              2, 0, 0, 0, 1,
              0, 0, 0, 1, 2), nrow = 10, byrow = TRUE)

colnames(R) <- c("apple", "banana", "orange", "pear", "grape")

R <- as.rankings(R)

# take rows 9 and 10 supposing that it belongs to a different fold in a
# cross-validation
R <- R[-c(9:10), ]

G <- group(R, index = 1:length(R))
p <- data.frame(p = rep(1, length(G)))
dt <- cbind(G, p)

pl <- pltree(G ~ p, data = dt)

# it does not work as shown in issue #25 
predict(pl, newdata = dt)
AIC(pl, newdata = dt)

# but works with vcov = FALSE for predict()
predict(pl, newdata = dt, vcov = FALSE)

# and still dont work for AIC 
AIC(pl, newdata = dt, vcov = FALSE)

# this because orange got off of the network when we sampled the folds
a <- adjacency(R)

plot(network(a))

# the issue still persists even if we increase npseudo 
pl2 <- pltree(G ~ p, data = dt, npseudo = 0.8)

The question is, do you think that this problem can be solved with npseudo (eventually) or should we deal with it by passing vcov = FALSE to the predict() method?

Thanks in advance

hturner commented 3 years ago

Thanks for digging down to find the cause of this issue.

The addition of pseudo rankings allows the worth to be estimated, but these pseudo rankings are removed before estimating the variance-covariance matrix. If an item is then completely missing from the rankings this leads to zero rows and columns in the Information matrix which makes it non-invertible, so the variance can't be estimated. I am not sure what the appropriate fix should be here but will follow this up (it may be a few months before I get to it as prioritising work on PLADMM in May/June).

AIC.pltree() doesn't need to compute the variance-covariance matrix, that was throwing an error due to a call to itempar() which defaults to vcov = TRUE. I have replaced this call and made a PR to the master branch; once that's merged in AIC(pl, newdata = dt) should work if you install the package from GitHub. However as newdata is actually the original data used in the fit here, it would be better to simply call AIC(pl) which avoids even more unnecessary computation and should work with the current PlackettLuce release (0.4.0). (This also goes for the call to predict - better not to specify newdata unless you are specifying data that is different from the data used in the fit!)