Closed basvank closed 8 years ago
Thanks for your discussion. I'd like to give you some expalantions about this, hopefully it helps you answer your questions.
1). First of all, there are some incorrect metric calculations in Librec, such as AUC. Thus the AUC calculation in CARSKit is incorrect too. But basically you can trust others. 2). You cannot apply a non-context-aware data with CARSKit, as you mentioned, the data set applied to CARSKit should be in format as follows: user,item,rating,context:na 3). In CARS, the evaluation is different from traditional ones in terms of ranking metrics. It is evaluated by each context each user, since we are going to recommend a list of items to (user, context) pair
Hopefully those insights can help you understand the difference between librec and CARSKit. Let me know if you have further questions.
a sample of data format you should prepare:
user,item,rating,time,location 1,applebees,1,weekday,school 1,burger king,4,weekday,school 1,carls jr,5,weekday,school 1,costco,5,weekday,school 1,el mazateno,1,weekday,school 1,kentucky fried chicken,5,weekday,school 1,mc donals,1,weekday,school 2,applebees,5,weekday,school 2,daruma,5,weekday,school
system will reconigz it and convert it
Thanks for your quick response. Regarding your explanations:
1). Good to know, I have not seen this anywhere, but that explains some of the problems in the results indeed 2). I understand this, but as I showed in my initial message I have converted the context-unaware MovieLens 100K data set to a context-aware one with the context NA for each and every rating, as was also explained in the user guide. So the dataset is indeed context-aware, but with only the context NA 3). This also makes sense, but as I understand this would mean that when you have a context-aware data set where each rating has context NA as in the example I showed (so the last value for each item in the data set is 1 for context:na), this would basically "reduce" to a context-unaware variant. So the library should (and does, in fact) recommend items for (x, context=context:na) for all users, so only one context for each user, and as all ratings in the data are in this same context context:na, all users and all items/ratings are in the same context and this is thus irrelevant for the recommendation process. Thus, in this case, a context-unaware recommender should produce the same results as a context-aware recommender. This is also what I want, because I want the results of context-unaware recommenders (such as ItemKNN, UserKNN and SVD++) as a baseline to be able to compare the context-aware algorithms later on.
This is the data set I use, as you can see it is already context-aware with only 1 context:
OKay, if all your data format and experiments were correct, there are two remaining reasons coming up into my mind:
1). Upon your experimental results, I guess there are some differences on the evaluations between librec and CARSKit. Let's take the ranking evaluation for example, librec will not evaluate all the items for ranking, there is a selection process for the item candidate. CARSKit followed this way and made changes accordingly. I guess this is the main reason, you can double check the evalranking() function in Recommender.java. I will double check that too.
2). You may double check the "CARSKit.Workspace" folder, there is a file named "ratings_binary" which is the final rating file used for prediction. You can double check whether this file is in the correct format or not.
Note not all context-aware recommendation works better than non-contextual ones. It varies from domain to domain, data to data, especially when it comes to which context variables you used in the data. I know you did not go further on this step. Just FYI
1). I am aware of this, that will be the point to investigate in my thesis. However, at the moment I am only looking at setting context-unaware baselines, so this should not be a problem as of now.
2). Do you mean here that the splitting in training and test data is done randomly? That is a good point. I think I know a way of eliminating this: if I manually split the data in training and test sets and supply those to both algorithms, which they support using the test-set option in evaluation.setup if I understand correctly, the comparison would be easier to make. I think we can then also compare the output files to see if they have anything in common at all. Do you agree with this?
3). I have checked this and it looks exactly like the movielens100kratings.txt file I uploaded above, so that's fine
Yes, the evaluation is important, you can simply use training-testing evaluation. In terms of cross validation, previously there was a bug on the librec, but I remembered it was fixed. You can double check the output files (in rating predictions) to see whether they are the same folds for different algorithms. I still suspect the evalranking() function, where the evaluation is a little bit difference from normal in librec and CARSKit.
Ok, I will have another look at it tomorrow. Thanks for the feedback.
I believe I found the cause of the differences between LibRec and CARSKit: the CARSKit library includes already rated items in the recommendation list, while the LibRec does not. This causes the other items to drop in the list (as already rated items almost always appear higher) or even fall out of the list. However, already rated items are not considered correctly predicted items, so these have a negative effect on all metrics.
It seems that this behavior is due to this code, where it becomes apparent that this is a deliberate choice. I understand this consideration, but after giving it some thought I believe the required behavior might differ depending on the use case:
Do you agree that these are different use-cases that can be supported by CARSKit? I think it could be a setting with the current behavior as the default. Having looked at it quickly I believe the line I referred to before can be altered to support the second use-case, implementing the third use-case might be a bit more complicated.
Well, thanks for your finding. Yes, as I mentioned before, I guess the differece lied in the evalranking() function. Right now, I am able to remember this operation, since in CARS data set, users may rate items more than one time, so it is not neccesary to restrict a unique (user, item) in ranking.
Well, by your given suggestions and concerns, actually we can add context as another constraint, and make sure we will not add this item into candidate list if user already rated this item in a specific context. How do you think about that? As you mentioned, yes, the 3rd case is complicated, we may only evaluate the algorithms in a uniformed and general case. How do you think about it?
I have updated the evalRanking() in Recommender.java. Let me know if you have further questions.
I have a small comment in your commit
Hello, the changes above removed items which have been rated by a given user in given context from the candidate list for evaluations.
Also, if you are interested in revising and building the CARSKit library, please let me know. I will add you to the contributor list.
When going through literature and the internet concerned with context-aware recommender systems I came across your CARSKit library which looks very promising. I am interested in this for my masters' thesis as for that I will compare the behavior of different context-aware recommender approaches and their context-unaware counterparts across different datasets, which is all supported by CARSKit.
However, when I started experimenting with the library I came across some unexpected behavior. I started with the MovieLens 100K context-unaware dataset but it did not produce the results I expected based on the data on http://www.librec.net/example.html. As far as I understand, the context-unaware algorithms are exact copies of the implementations provided by the LibRec library, so it should produce at least similar results. Please correct me if I am wrong here.
The rating results are similar, with maximum differences of 0.14% in RMSE and MAE, so not significant. However, I found out the results for top N recommendation (ranking) differ significantly. For this, I found the following results:
Where "[[algorithm]] site" is what is reported on http://www.librec.net/example.html, "[[algorithm]] librec" is what is produced by LibRec v1.3 and "[[algorithm]] cars" is what is produced by CARSKit v0.2.0.
So, I summarized the results as follows, showing the relative difference between the results reported on the LibRec site and produced by the two algorithms for the ItemKNN and UserKNN algorithms and the relative difference between LibRec and CARSKit for the SVD++ algorithm, which has no results on the site:
So as you can see, all results for LibRec are within 4% of what is reported on their site for all metrics except MAP and NDCG (still need to figure out why that is, but that seems unrelated), but differ over at least 30% up to even 90% for all metrics except AUC for the CARS library.
Of course I realize that LibRec could have it all wrong, both in the library and therefore also on their site, but as you claim that CARS is based on LibRec the difference should at least be explainable. Furthermore, my gut feeling says that LibRec has it right as it is widely adopted and their results correspond closely to for instance the results reported by M. Levy and K. Jack in Efficient Top-N Recommendation by Linear Regression (RecSys conference 2013).
One final thing I can think of is that I used the wrong format to represent a context-unaware data set. Based on your user guide I formatted the file as follows, which as I understand gives all ratings the context NA and thus the data set should be context-unaware:
So, summarizing, I was wondering if you have any explanation for this behavior, whether this is a known problem or if this is the desired behavior? If not, do you have any idea what the cause of this difference can be? I am willing and able to dive into the code, but at first glance it seems similar to LibRec. So maybe you are able to indicate where this library significantly differs?
Hope this helps you and we can figure it out! Keep up the good work!
For reference, here are the log outputs for the runs I did to arrive at the above results, to show that I used the same settings for LibRec and CARSKit (based on what is reported on the LibRec site):