Closed aliko-str closed 4 years ago
Thank you for the report. Please send me a complete example code, the data and what the expected correct output should be. This way I can debug the code faster. Regards, Michael
rmse = function(v1,v2){ return(sqrt(sum((v1-v2)^2)/length(v1))); } tstFr=read.table(file.path([DIR], "tstFr.txt"), header = T, sep="\t", stringsAsFactors = F); trM = as.matrix(read.table(file.path([DIR], "trM.txt"), header = T, sep="\t", stringsAsFactors = F)); trM.rrm = as(trM, "realRatingMatrix"); p <- list( method = "cosine", nn = 10, sample = FALSE, normalize="Z-score" ); rec <- Recommender(trM.rrm, method = "UBCF", parameter= p); pred = predict(rec, trM.rrm, type="ratingMatrix"); predTestFr = merge(tstFr, as(pred, "data.frame"), by.x = c("uid", "iid"), by.y = c("user", "item")); rmse(predTestFr$rt, predTestFr$rating); # returns 1.693067, should be 1.726648
Thank you for the helpful example. I have (hopefully) fixed the bug with the following changes to the code:
## similarity of the neighbors
s_uk <- sapply(1:nrow(sim), FUN=function(x)
sim[x, neighbors[,x]])
## calculate the weighted sum
ratings<- t(sapply(1:nrow(newdata), FUN=function(i) {
## neighbors ratings of active user i
r_neighbors <- as(model$data[neighbors[,i]], "dgCMatrix")
## normalize by the sum of weights only if a rating is available
has_r_neighbors <- as(r_neighbors, "lgCMatrix")
drop(as(crossprod(r_neighbors, s_uk[,i]), "matrix"))/drop(as(crossprod(has_r_neighbors, s_uk[,i]), "matrix"))
}))
You can install the fixed version from GitHub. Please test and let me know if this resolves the bug. Regards, -Michael
The new code produces NAs for cases when the entire neighborhood for the active user has no data (i.e., all are NA). The previous version would produce 0 instead - which was basically the mean rating of the active user after de-normalization. I think this was a better approach then generating NAs in the output.
I was thinking about that and lean more to the idea that if the algorithm cannot determine a valid rating then it should return NA. The main reasons are:
Maybe I should add a function that adds the users average rating if we don't know the rating?
Knowing how many times an algorithm failed may be of limited utility, though I wouldn't strongly argue against it in principle - some may need it.
Producing NAs may also bias performance estimates, particularly for real-world datasets that have more user-item ratings for popular items, and popular items often have above-average ratings. I'm not sure the bias will be normally distributed with mean=0, and if it isn't, we can't correct for it with bootstrapping and this becomes a problem.
There is also a somewhat philosophical question of why having 1 rating in a (potentially large) neighborhood should differ so drastically from having 0 ratings, but I guess it's a topic for those interested in Bayesian statistics, and we may want to avoid venturing there, since we are discussing the plain UBCF.
I guess having a separate function to replace NAs with user-average ratings would be a good compromise between the two viewpoints.
I completely agree with this statement:
There is also a somewhat philosophical question of why having 1 rating in a (potentially large) neighborhood should differ so drastically from having 0 ratings
A practical answer is that we probably should use a rather large neighborhood to make sure there are enough ratings so the averaging makes sense. There is also an issue with calculating similarities. I think it is currently biased towards picking users with only very few ratings. This can be somewhat reduced by requiring the users in the database to have a minimum number of ratings...
Here is a solution for the NA's. Use a hybrid recommender, then the second recommender will replace the NAs. I tried it with a POPULAR recommender which usees item popularity and centers the ratings for the current user.
hybrid <- HybridRecommender(rec, rec_pop, weights = c(0.999, 0.001))
Note: You need to update to the latest version on GitHub.
Ok, thank you for your help and for publishing recommenderlab - it's been rather helpful.
The issues are with these lines:
If training data contain NAs, r_a_norms doesn't count them, but sum_s_uk still contains 'proximity' to all neighbors, which leads to normalization by a larger number than it should be ==> sum_s_uk should be a matrix, not a vector.
sum_s_uk <- colSums(s_uk, na.rm=TRUE); # <--This line could be replaced with these:
And this line ratings <- t(r_a_norms)/sum_s_uk
Could be replaced with this: ratings <- t(r_a_norms/sum_s_uk.perPerson);