Results differ between CARSKit and LibRec?

basvank commented 8 years ago

When going through literature and the internet concerned with context-aware recommender systems I came across your CARSKit library which looks very promising. I am interested in this for my masters' thesis as for that I will compare the behavior of different context-aware recommender approaches and their context-unaware counterparts across different datasets, which is all supported by CARSKit.

However, when I started experimenting with the library I came across some unexpected behavior. I started with the MovieLens 100K context-unaware dataset but it did not produce the results I expected based on the data on http://www.librec.net/example.html. As far as I understand, the context-unaware algorithms are exact copies of the implementations provided by the LibRec library, so it should produce at least similar results. Please correct me if I am wrong here.

The rating results are similar, with maximum differences of 0.14% in RMSE and MAE, so not significant. However, I found out the results for top N recommendation (ranking) differ significantly. For this, I found the following results:

	Prec@5	Prec@10	Recall@5	Recall@10	AUC	MAP	NDCG	MRR
ItemKNN site	0,318	0,260	0,103	0,164	0,885	0,187	0,536	0,554
ItemKNN librec	0,321	0,259	0,105	0,162	0,907	0,093	0,198	0,550
ItemKNN cars	0,158	0,140	0,069	0,116	0,864	0,053	0,125	0,345
UserKNN site	0,338	0,280	0,116	0,182	0,884	0,208	0,554	0,569
UserKNN librec	0,327	0,278	0,115	0,181	0,915	0,104	0,214	0,556
UserKNN cars	0,089	0,098	0,033	0,078	0,803	0,023	0,071	0,202
SVD++ site	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
SVD++ librec	0,038	0,039	0,009	0,018	0,632	0,005	0,021	0,081
SVD++ cars	0,025	0,028	0,006	0,014	0,607	0,004	0,015	0,056

Where "[[algorithm]] site" is what is reported on http://www.librec.net/example.html, "[[algorithm]] librec" is what is produced by LibRec v1.3 and "[[algorithm]] cars" is what is produced by CARSKit v0.2.0.

So, I summarized the results as follows, showing the relative difference between the results reported on the LibRec site and produced by the two algorithms for the ItemKNN and UserKNN algorithms and the relative difference between LibRec and CARSKit for the SVD++ algorithm, which has no results on the site:

	Prec@5	Prec@10	Recall@5	Recall@10	AUC	MAP	NDCG	MRR
ItemKNN % difference librec wrt site	0,94%	-0,38%	1,94%	-1,22%	2,49%	-50,27%	-63,06%	-0,72%
ItemKNN % difference cars wrt site	-50,31%	-46,15%	-33,01%	-29,27%	-2,37%	-71,66%	-76,68%	-37,73%

	Prec@5	Prec@10	Recall@5	Recall@10	AUC	MAP	NDCG	MRR
UserKNN % difference librec wrt site	-3,25%	-0,71%	-0,86%	-0,55%	3,51%	-50,00%	-61,37%	-2,28%
UserKNN % difference cars wrt site	-73,67%	-65,00%	-71,55%	-57,14%	-9,16%	-88,94%	-87,18%	-64,50%

	Prec@5	Prec@10	Recall@5	Recall@10	AUC	MAP	NDCG	MRR
SVD++ % difference cars wrt librec	-34,21%	-28,21%	-33,33%	-22,22%	-3,96%	-20,00%	-28,57%	-30,86%

So as you can see, all results for LibRec are within 4% of what is reported on their site for all metrics except MAP and NDCG (still need to figure out why that is, but that seems unrelated), but differ over at least 30% up to even 90% for all metrics except AUC for the CARS library.

Of course I realize that LibRec could have it all wrong, both in the library and therefore also on their site, but as you claim that CARS is based on LibRec the difference should at least be explainable. Furthermore, my gut feeling says that LibRec has it right as it is widely adopted and their results correspond closely to for instance the results reported by M. Levy and K. Jack in Efficient Top-N Recommendation by Linear Regression (RecSys conference 2013).

One final thing I can think of is that I used the wrong format to represent a context-unaware data set. Based on your user guide I formatted the file as follows, which as I understand gives all ratings the context NA and thus the data set should be context-unaware:

user,item,rating,context:na
196,242,3,1
186,302,3,1
22,377,1,1
244,51,2,1
...

So, summarizing, I was wondering if you have any explanation for this behavior, whether this is a known problem or if this is the desired behavior? If not, do you have any idea what the cause of this difference can be? I am willing and able to dive into the code, but at first glance it seems similar to LibRec. So maybe you are able to indicate where this library significantly differs?

Hope this helps you and we can figure it out! Keep up the good work!

For reference, here are the log outputs for the runs I did to arrive at the above results, to show that I used the same settings for LibRec and CARSKit (based on what is reported on the LibRec site):

UserKNN rating librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- UserKNN,0.736409,0.943499,0.184102,0.699130,0.988028,0.576380,,60, PCC, 25,'xx:xx','xx:xx'
ItemKNN rating librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- ItemKNN,0.723676,0.923718,0.180919,0.686630,0.970820,0.572490,,40, PCC, 2500,'xx:xx','xx:xx'

UserKNN rating cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by UserKNN, MAE: 0.736947, RMSE: 0.944363, NAME: 0.184237, rMAE: 0.700390, rRMSE: 0.988958, MPE: 0.000000, 60, PCC, 25, Time: 'xx:xx','xx:xx'
ItemKNN rating cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by ItemKNN, MAE: 0.724341, RMSE: 0.924782, NAME: 0.181085, rMAE: 0.687070, rRMSE: 0.972629, MPE: 0.000000, 40, PCC, 2500, Time: 'xx:xx','xx:xx'

UserKNN top N librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- UserKNN,0.327498,0.277956,0.114772,0.180838,0.914600,0.103961,0.213676,0.555877,,80, COS, 50,'xx:xx','xx:xx'
ItemKNN top N librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- ItemKNN,0.320504,0.259231,0.104593,0.161601,0.907029,0.092651,0.197906,0.550298,,80, COS, 50,'xx:xx','xx:xx'

UserKNN top N cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by UserKNN, Pre5: 0.089234,Pre10: 0.098344, Rec5: 0.033394, Rec10: 0.077723, AUC: 0.803486, MAP: 0.022585, NDCG: 0.070543, MRR: 0.201989, 80, COS, 50, Time: 'xx:xx','xx:xx'
ItemKNN top N cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by ItemKNN, Pre5: 0.157872,Pre10: 0.139542, Rec5: 0.069404, Rec10: 0.115820, AUC: 0.863906, MAP: 0.052911, NDCG: 0.124604, MRR: 0.344676, 80, COS, 50, Time: 'xx:xx','xx:xx'

SVD++ rating librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- SVD++,0.718764,0.912503,0.179691,0.681520,0.956593,0.575460,,5, 0.01, -1.0, 0.1, 0.1, 0.1, 100, true,'xx:xx','xx:xx'

SVD++ rating cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by SVD++, MAE: 0.720267, RMSE: 0.913879, NAME: 0.180067, rMAE: 0.682750, rRMSE: 0.958558, MPE: 0.000000, numFactors: 5, numIter: 100, lrate: 0.01, maxlrate: -1.0, regB: 0.1, regU: 0.1, regI: 0.1, regC: 0.1, isBoldDriver: true, Time: 'xx:xx','xx:xx'

SVD++ top N librec
[INFO ] 2016-02-01 xx:xx:xx,xxx -- SVD++,0.038287,0.039476,0.009236,0.018094,0.632358,0.005420,0.020654,0.081348,,5, 0.01, -1.0, 0.1, 0.1, 0.1, 100, true,'xx:xx','xx:xx'

SVD++ top N cars
[INFO ] 2016-02-01 xx:xx:xx,xxx -- Final Results by SVD++, Pre5: 0.025109,Pre10: 0.027889, Rec5: 0.006315, Rec10: 0.014330, AUC: 0.607030, MAP: 0.003629, NDCG: 0.014905, MRR: 0.056455, numFactors: 5, numIter: 100, lrate: 0.01, maxlrate: -1.0, regB: 0.1, regU: 0.1, regI: 0.1, regC: 0.1, isBoldDriver: true, Time: 'xx:xx','xx:xx'

irecsys commented 8 years ago

Thanks for your discussion. I'd like to give you some expalantions about this, hopefully it helps you answer your questions.

1). First of all, there are some incorrect metric calculations in Librec, such as AUC. Thus the AUC calculation in CARSKit is incorrect too. But basically you can trust others. 2). You cannot apply a non-context-aware data with CARSKit, as you mentioned, the data set applied to CARSKit should be in format as follows: user,item,rating,context:na 3). In CARS, the evaluation is different from traditional ones in terms of ranking metrics. It is evaluated by each context each user, since we are going to recommend a list of items to (user, context) pair

Hopefully those insights can help you understand the difference between librec and CARSKit. Let me know if you have further questions.

irecsys commented 8 years ago

a sample of data format you should prepare:

user,item,rating,time,location 1,applebees,1,weekday,school 1,burger king,4,weekday,school 1,carls jr,5,weekday,school 1,costco,5,weekday,school 1,el mazateno,1,weekday,school 1,kentucky fried chicken,5,weekday,school 1,mc donals,1,weekday,school 2,applebees,5,weekday,school 2,daruma,5,weekday,school

system will reconigz it and convert it

basvank commented 8 years ago

Thanks for your quick response. Regarding your explanations:

1). Good to know, I have not seen this anywhere, but that explains some of the problems in the results indeed 2). I understand this, but as I showed in my initial message I have converted the context-unaware MovieLens 100K data set to a context-aware one with the context NA for each and every rating, as was also explained in the user guide. So the dataset is indeed context-aware, but with only the context NA 3). This also makes sense, but as I understand this would mean that when you have a context-aware data set where each rating has context NA as in the example I showed (so the last value for each item in the data set is 1 for context:na), this would basically "reduce" to a context-unaware variant. So the library should (and does, in fact) recommend items for (x, context=context:na) for all users, so only one context for each user, and as all ratings in the data are in this same context context:na, all users and all items/ratings are in the same context and this is thus irrelevant for the recommendation process. Thus, in this case, a context-unaware recommender should produce the same results as a context-aware recommender. This is also what I want, because I want the results of context-unaware recommenders (such as ItemKNN, UserKNN and SVD++) as a baseline to be able to compare the context-aware algorithms later on.

basvank commented 8 years ago

This is the data set I use, as you can see it is already context-aware with only 1 context:

movielens100kratings.txt

irecsys commented 8 years ago

OKay, if all your data format and experiments were correct, there are two remaining reasons coming up into my mind:

1). Upon your experimental results, I guess there are some differences on the evaluations between librec and CARSKit. Let's take the ranking evaluation for example, librec will not evaluate all the items for ranking, there is a selection process for the item candidate. CARSKit followed this way and made changes accordingly. I guess this is the main reason, you can double check the evalranking() function in Recommender.java. I will double check that too.

2). You may double check the "CARSKit.Workspace" folder, there is a file named "ratings_binary" which is the final rating file used for prediction. You can double check whether this file is in the correct format or not.

Note not all context-aware recommendation works better than non-contextual ones. It varies from domain to domain, data to data, especially when it comes to which context variables you used in the data. I know you did not go further on this step. Just FYI

basvank commented 8 years ago

1). I am aware of this, that will be the point to investigate in my thesis. However, at the moment I am only looking at setting context-unaware baselines, so this should not be a problem as of now.

2). Do you mean here that the splitting in training and test data is done randomly? That is a good point. I think I know a way of eliminating this: if I manually split the data in training and test sets and supply those to both algorithms, which they support using the test-set option in evaluation.setup if I understand correctly, the comparison would be easier to make. I think we can then also compare the output files to see if they have anything in common at all. Do you agree with this?

3). I have checked this and it looks exactly like the movielens100kratings.txt file I uploaded above, so that's fine

irecsys commented 8 years ago

Yes, the evaluation is important, you can simply use training-testing evaluation. In terms of cross validation, previously there was a bug on the librec, but I remembered it was fixed. You can double check the output files (in rating predictions) to see whether they are the same folds for different algorithms. I still suspect the evalranking() function, where the evaluation is a little bit difference from normal in librec and CARSKit.

basvank commented 8 years ago

Ok, I will have another look at it tomorrow. Thanks for the feedback.

basvank commented 8 years ago

I believe I found the cause of the differences between LibRec and CARSKit: the CARSKit library includes already rated items in the recommendation list, while the LibRec does not. This causes the other items to drop in the list (as already rated items almost always appear higher) or even fall out of the list. However, already rated items are not considered correctly predicted items, so these have a negative effect on all metrics.

It seems that this behavior is due to this code, where it becomes apparent that this is a deliberate choice. I understand this consideration, but after giving it some thought I believe the required behavior might differ depending on the use case:

The case supported by the current behavior, where already rated items are also shown in the top-N recommendation list
A case where already rated items are not shown in the context in which they are rated, but are shown in other contexts for a particular user. For instance: when I have watched a movie while at home alone I do not want to see that movie recommended at a later time while watching alone at home. However, when I am with a friend that has not seen that particular movie, I might want to get it recommended because I want to show it to him and am willing to watch it again if it was very good. This would mean that the exact (user, item, context) combinations that appear in the training set should be filtered from the recommendations.
A case where users do not want to see items that they have rated/bought in any context at all, but the context information is used to improve the recommendations for other users. This is actually my use-case. For instance, a webshop where purchased items do not have to be shown to buyers, but the fact that they bought an item in a certain context (for instance in the weekend) can help improve the recommendations for other users of the system. This would mean that all (user, item) combinations that appear in the training set, without considering the context, should be filtered from the recommendations.

Do you agree that these are different use-cases that can be supported by CARSKit? I think it could be a setting with the current behavior as the default. Having looked at it quickly I believe the line I referred to before can be altered to support the second use-case, implementing the third use-case might be a bit more complicated.

irecsys commented 8 years ago

Well, thanks for your finding. Yes, as I mentioned before, I guess the differece lied in the evalranking() function. Right now, I am able to remember this operation, since in CARS data set, users may rate items more than one time, so it is not neccesary to restrict a unique (user, item) in ranking.

Well, by your given suggestions and concerns, actually we can add context as another constraint, and make sure we will not add this item into candidate list if user already rated this item in a specific context. How do you think about that? As you mentioned, yes, the 3rd case is complicated, we may only evaluate the algorithms in a uniformed and general case. How do you think about it?

irecsys commented 8 years ago

I have updated the evalRanking() in Recommender.java. Let me know if you have further questions.

basvank commented 8 years ago

I have a small comment in your commit

irecsys commented 8 years ago

Hello, the changes above removed items which have been rated by a given user in given context from the candidate list for evaluations.

Also, if you are interested in revising and building the CARSKit library, please let me know. I will add you to the contributor list.

irecsys / CARSKit

Results differ between CARSKit and LibRec? #1