UBCF not handling properly training data with NAs

aliko-str commented 5 years ago

The issues are with these lines:

sum_s_uk <- colSums(s_uk, na.rm=TRUE)
## calculate the weighted sum
r_a_norms <- sapply(1:nrow(newdata), FUN=function(i) {
  ## neighbors ratings of active user i
  r_neighbors <- as(model$data[neighbors[,i]], "dgCMatrix")
  drop(as(crossprod(r_neighbors, s_uk[,i]), "matrix"))
})
ratings <- t(r_a_norms)/sum_s_uk

If training data contain NAs, r_a_norms doesn't count them, but sum_s_uk still contains 'proximity' to all neighbors, which leads to normalization by a larger number than it should be ==> sum_s_uk should be a matrix, not a vector.

sum_s_uk <- colSums(s_uk, na.rm=TRUE); # <--This line could be replaced with these:

sum_s_uk.perPerson = sapply(1:nrow(newdata), FUN=function(i) {
  d_neighbors=s_uk[,i];
  r_neighbors=as(model$data[neighbors[,i]], "matrix");
  r_neighbors[!is.na(r_neighbors)] = 1;
  dr_neighbours =r_neighbors* d_neighbors;
  return(apply(dr_neighbours, 2, sum, na.rm = T));
});
sum_s_uk.perPerson[sum_s_uk.perPerson==0] = 1; # <== This is a quick&dirty solution to the case of no ratings (all are NAs for all neighbors) translating in 0; We want to avoid dividing by 0, and it doesn't matter what we replace it with, either 1 or whatever else non-zero and non-infinity, because the corresponding r_a_norms will 0 anyways.

And this line ratings <- t(r_a_norms)/sum_s_uk

Could be replaced with this: ratings <- t(r_a_norms/sum_s_uk.perPerson);

mhahsler commented 5 years ago

Thank you for the report. Please send me a complete example code, the data and what the expected correct output should be. This way I can debug the code faster. Regards, Michael

aliko-str commented 5 years ago

tstFr.txt trM.txt

rmse = function(v1,v2){ return(sqrt(sum((v1-v2)^2)/length(v1))); } tstFr=read.table(file.path([DIR], "tstFr.txt"), header = T, sep="\t", stringsAsFactors = F); trM = as.matrix(read.table(file.path([DIR], "trM.txt"), header = T, sep="\t", stringsAsFactors = F)); trM.rrm = as(trM, "realRatingMatrix"); p <- list( method = "cosine", nn = 10, sample = FALSE, normalize="Z-score" ); rec <- Recommender(trM.rrm, method = "UBCF", parameter= p); pred = predict(rec, trM.rrm, type="ratingMatrix"); predTestFr = merge(tstFr, as(pred, "data.frame"), by.x = c("uid", "iid"), by.y = c("user", "item")); rmse(predTestFr$rt, predTestFr$rating); # returns 1.693067, should be 1.726648

mhahsler commented 5 years ago

Thank you for the helpful example. I have (hopefully) fixed the bug with the following changes to the code:

      ## similarity of the neighbors
      s_uk <- sapply(1:nrow(sim), FUN=function(x)
        sim[x, neighbors[,x]])

      ## calculate the weighted sum
      ratings<- t(sapply(1:nrow(newdata), FUN=function(i) {
        ## neighbors ratings of active user i
        r_neighbors <- as(model$data[neighbors[,i]], "dgCMatrix")
        ## normalize by the sum of weights only if a rating is available
        has_r_neighbors <- as(r_neighbors, "lgCMatrix")
        drop(as(crossprod(r_neighbors, s_uk[,i]), "matrix"))/drop(as(crossprod(has_r_neighbors, s_uk[,i]), "matrix"))
      }))

You can install the fixed version from GitHub. Please test and let me know if this resolves the bug. Regards, -Michael

aliko-str commented 5 years ago

The new code produces NAs for cases when the entire neighborhood for the active user has no data (i.e., all are NA). The previous version would produce 0 instead - which was basically the mean rating of the active user after de-normalization. I think this was a better approach then generating NAs in the output.

mhahsler commented 5 years ago

I was thinking about that and lean more to the idea that if the algorithm cannot determine a valid rating then it should return NA. The main reasons are:

We can analyze for how many items an algorithm cannot produce ratings.
In a hybrid recommender, the other recommenders can then take over.

Maybe I should add a function that adds the users average rating if we don't know the rating?

aliko-str commented 5 years ago

Knowing how many times an algorithm failed may be of limited utility, though I wouldn't strongly argue against it in principle - some may need it.

Producing NAs may also bias performance estimates, particularly for real-world datasets that have more user-item ratings for popular items, and popular items often have above-average ratings. I'm not sure the bias will be normally distributed with mean=0, and if it isn't, we can't correct for it with bootstrapping and this becomes a problem.

There is also a somewhat philosophical question of why having 1 rating in a (potentially large) neighborhood should differ so drastically from having 0 ratings, but I guess it's a topic for those interested in Bayesian statistics, and we may want to avoid venturing there, since we are discussing the plain UBCF.

I guess having a separate function to replace NAs with user-average ratings would be a good compromise between the two viewpoints.

mhahsler commented 5 years ago

I completely agree with this statement:

There is also a somewhat philosophical question of why having 1 rating in a (potentially large) neighborhood should differ so drastically from having 0 ratings

A practical answer is that we probably should use a rather large neighborhood to make sure there are enough ratings so the averaging makes sense. There is also an issue with calculating similarities. I think it is currently biased towards picking users with only very few ratings. This can be somewhat reduced by requiring the users in the database to have a minimum number of ratings...

mhahsler commented 5 years ago

Here is a solution for the NA's. Use a hybrid recommender, then the second recommender will replace the NAs. I tried it with a POPULAR recommender which usees item popularity and centers the ratings for the current user.

hybrid <- HybridRecommender(rec, rec_pop, weights = c(0.999, 0.001))

Note: You need to update to the latest version on GitHub.

aliko-str commented 5 years ago

Ok, thank you for your help and for publishing recommenderlab - it's been rather helpful.

mhahsler / recommenderlab

UBCF not handling properly training data with NAs #34