Extending timeSVD++ with a visual component

trimkaleci commented 6 years ago

Hello to everyone,

I want to extend the timeSVD++ model with an additional component, which is a visual component presented here, in TVBPR model respectivelly. But, while thinking about it, I am not sure about how to train the model, as the models mentioned above use different learning approaches. One(timeSVD++) is trained using explicit information (i.e. ratings), whereas the other one (TVBPR) is based on BPR (Bayesien Personalized Ranking) learning approach (i.e. making use of positive items and negative items).

Could anyone of you please suggest me with any idea if this would be possible, or any help?

Note: the extended version of timeSVD++ will be used in the context of fashion data, and thus will be tested using the already provided data by Amazon (link to the data, here).

Thank you very much in advance!

trimkaleci commented 6 years ago

Does anyone have any idea with regard to the question above?

Thanks a lot in advance!

SunYatong commented 6 years ago

Hi, @leo1023 . In my opinion, BPR is only suitable for ranking task and TimeSVD is designed for rating prediction task. so if you want to combine the two models, you can take the idea of how TVBPR modeling the visual information and use it with TimeSVD.

I haven't read TVBPR's paper yet. Maybe I will give you a more concrete answer after reading it.

trimkaleci commented 6 years ago

Hi @SunYatong . Thanks a lot for writing!

Yes, that was confusing for me, since we deal with algorithms which are designed for different tasks. I have already done an implementation of the combination of both models, and I use similar idea for modeling the visual information. While using the error (error = predicted_rating - real_rating) for learning the model, I was getting high values (which were going to infinity), because of the visual information being involved, thus it doesn't make sense using the error in this context. After that, I did not use the explicit feedback(i.e. ratings) for learning the model, but only the information about which items were rated, and thus the model parameters were learned similarly as in TVBPR (where positive items and negative items are considered), but instead I considered only the positive items. For example, in TVBPR, the parameters are learned with respect to deri = 1 / (1 + exp(x_u,i - x_u,j)) where x_u,i is the predicted value for the positive item, whereas the x_u,j is the predicted value for the negative item. Whereas, with regard to timeSVD++, as only positive items are considered, the parameters are learned with respect to deri = 1 / (1 + exp(-x_u,i)).

Then, for evaluating the performance of the three models, I am using the AUC as defined in the paper where TVBPR is presented. The data for testing and validation set are split by using 2 -leave-out cross validation, such that for each user there is one item in the test set and one in the validation set. Then, the AUC for single user is calculated (if we evaluate the model on the test set) by calculating the preference value of the testing item, and then calculating the preference values for all negative items with respect to the user being considered. A correct prediction is considered if the preference value of the testing item is greater than the preference value of the negative item (this is based on the theory of BPR where is stated that positive items should be more preferred compared to negative items). Counting the number of total correct predictions against the total number of negative items, we would have the probability that the model would predict correctly for that user. At the end, having AUC calculated for all users, we calculate the total AUC showing the performance of the model, by summing up the AUC of each user and dividing it by the number of users.

For the moment, I get the following results: - timeSVD++: AUC = 0.5508 - TVBPR: AUC = 0.7010 - timeSVD++ plus the visual component from TVBPR: AUC = 0.43811

But, I am not sure why do I get these results. For example, what might be the reason I get smaller value of AUC on timeSVD++? Or, why do I get greater value of AUC when using the TVBPR model?

Do you have any idea what might be the reasons of having these results? Or, what results would you expect and why?

I would appreciate any idea or advice from your side!

Thank you very much in advance!!

SunYatong commented 6 years ago

Hi, the pairwise loss is used to maximize the difference between positive item and negative item, but you said "with regard to timeSVD++, as only positive items are considered, the parameters are learned with respect to deri = 1 / (1 + exp(-x_u,i))". And I don't understand how do you build your loss function.

Besides, you'd better print the loss after each iteration to see if it is reduced.

On Sat, Dec 16, 2017 at 6:14 PM, Leo10 notifications@github.com wrote:

Hi @SunYatong https://github.com/sunyatong . Thanks a lot for writing!

Yes, that was confusing for me, since we deal with algorithms which are designed for different tasks. I have already done an implementation of the combination of both models, and I use similar idea for modeling the visual information. While using the error (error = predicted_rating - real_rating) for learning the model, I was getting high values (which were going to infinity), because of the visual information being involved, thus it doesn't make sense using the error in this context. After that, I did not use the explicit feedback(i.e. ratings) for learning the model, but only the information about which items were rated, and thus the model parameters were learned similarly as in TVBPR (where positive items and negative items are considered), but instead I considered only the positive items. For example, in TVBPR, the parameters are learned with respect to deri = 1 / (1

exp(x_u,i - x_u,j)) where x_u,i is the predicted value for the positive item, whereas the x_u,j is the predicted value for the negative item. Whereas, with regard to timeSVD++, as only positive items are considered, the parameters are learned with respect to deri = 1 / (1 + exp(-x_u,i)).

Then, for evaluating the performance of the three models, I am using the AUC as defined in the paper where TVBPR is presented. The data for testing and validation set are split by using 2 -leave-out cross validation, such that for each user there is one item in the test set and one in the validation set. Then, the AUC for single user is calculated (if we evaluate the model on the test set) by calculating the preference value of the testing item, and then calculating the preference values for all negative items with respect to the user being considered. A correct prediction is considered if the preference value of the testing item is greater than the preference value of the negative item (this is based on the theory of BPR where is stated that positive items should be more preferred compared to negative items). Counting the number of total correct predictions against the total number of negative items, we would have the probability that the model would predict correctly for that user. At the end, having AUC calculated for all users, we calculate the total AUC showing the performance of the model, by summing up the AUC of each user and dividing it by the number of users.

For the moment, I get the following results: - timeSVD++: AUC = 0.5508 - TVBPR: AUC = 0.7010 - timeSVD++ plus the visual component from TVBPR: AUC = 0.43811

But, I am not sure why do I get these results. For example, what might be the reason I get smaller value of AUC on timeSVD++? Or, why do I get greater value of AUC when using the TVBPR model?

Do you have any idea what might be the reasons of having these results? Or, what results would you expect and why?

I would appreciate any idea or advice from your side!

Thank you very much in advance!!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/guoguibing/librec/issues/227#issuecomment-352174124, or mute the thread https://github.com/notifications/unsubscribe-auth/AQWU7sjdbz_6esXSf5IxA6NS3iFlXYPxks5tA5gQgaJpZM4Q5IYS .

trimkaleci commented 6 years ago

Hi @SunYatong! Actually, "with respect to timeSVD++", I meant timeSVD++ with the visual component included. I am a bit confused here about how to build the loss function. What would you suggest? Because for learning the visual dimensions, there is an embedding matrix E introduced. Which, as described on the paper, it is defined as follows:

"Let fi denote the Deep CNN features of item i and F represent its number of dimensions (F = 4096). We further introduce a K X F embedding matrix E to linearly embed the high-dimensional feature vector fi into a much lower-dimensional (i.e., K, can be set to 20) visual style space. Namely, we take: theta_item = E fi "

Then we learn the values of the embedding matrix, and thus we would have the visual space. As I want to increase the values of the embedding matrix E, I am using deri = 1 / (1 + exp(-x_u,i)) as using the error gives me high values (this way, by using "deri", I am trying to have smaller value).

I am not really sure about how I can build the loss function. Can you please give me an idea?

Thanks a lot!!!

SunYatong commented 6 years ago

Hi, As you are solving a ranking problem with implicit feedback, if you want to use pointwise loss, you should use sigmoid to bound your predicted rating into (0,1) first and use log loss: -ylogy' - (1-y)log(1-y'), which means minimizing the difference between your predicted distribution and the real distribution.

If you are using the pairwise loss, your goal is to maximize the difference between the positive predictions and negative predictions, your loss function should be minimizing: -log(sigmoid(y_positive-y_negative))

trimkaleci commented 6 years ago

Hi @SunYatong , thanks a lot for your clarification! I just wanted to make sure:

Firstly, I will have to bound the predicted rating into (0, 1) range using sigmoid function, which should be as in the following: y' = 1/(1 + exp(-predicted_rating')), (should I also bound to (0,1) the real rating?). Then, having bound the predicted rating into (0,1) range (and maybe bounding the real rating to 0-1), we calculate the error by using the following calculation: error = -ylogy' - (1-y)log(1-y'). Once, having this, we can start the learning process (e.g. for the item bias, we would do it item_bias += lrate * error - regTerm).

Is that right?

Thank you!!

SunYatong commented 6 years ago

Hi, @leo1023

Yes, the real ratings should also be binarized. In LibRec we use the configuration "data.convert.binarize.threshold=?" to convert the explicit feedback into implicit feedback.
Your loss function is right.

trimkaleci commented 6 years ago

Hi @SunYatong! I tried what you propose, but as the predicted value is always a big value (e.g. when sigmoid function is applied in high values because of the image features , let's say of 178, I get 1.0 as output), then when log is applied to 1.0, I get 0. I am not really sure how to continue with this.

I am sharing my code with you, please if you have, I would really appreciate it! Here is the link to the code of timeSVD++ extended with the visual component: link to the code.

Please also find here the link to the file of the image features.

Thank you very much in advance!!

trimkaleci commented 6 years ago

Hi @SunYatong! I have one more question:

What would be the reason of timeSVD++ (the normal timeSVD++ without any extension) performing worse than TVBPR when using the evaluation methodology mentioned above (AUC)?

Thank you!!

SunYatong commented 6 years ago

Hi, @leo1023

For the loss question

Maybe you should sample some negative feedback, i.e., sample some zeros into your train set.
Or you can try pairwise loss like BPR.

For the performance question:

AUC is a ranking measurement. But TimeSVD is designed for rating prediction task, which may not have good performance on ranking. And TVBPR is designed directly for ranking.

trimkaleci commented 6 years ago

Hi @SunYatong !

I have done an implementation of timeSVD++ with the visual component included by using the pairwise loss like BPR. Now, when running it on fashion data, I get better performance from timeSVD++ extended with the visual component compared to TVBPR. In order to find the reason of why I get a better performance of the extended version of timeSVD++ compared to TVBPR, I experimented with them on different settings, as shown below:

epochs set to 5: AUC for timeSVDpp_visual_compnent is AUC = 0.7039, AUC for TVBPR is AUC = 0.8181
epochs set to 10: timeSVDpp_visual_compnent AUC = 0.8051, TVBPR AUC = 0.6830
epochs set to 20: timeSVDpp_visual_compnent AUC = 0.7872, TVBPR AUC = 0.6815

Now, I am experimenting by setting the number of non-visual factors to different values (i.e. from 10 to 50), and see how it influences the performance of the models. But, I am still not sure why the performance of the extended version of timeSVD++ is better than TVBPR. Could it be that we add more parameters for the factors of the users and make the non-visual factors of users time-dependent and also make use of implicit feedback (as it done on timeSVD++, Ru set), whereas in TVBPR the non-visual factors stays static (see paper here ? If this is the case, can you please give me any idea or suggestion of how can I prove it?

P.S. Please here find my implementation of TVBPR, which on dataset of 4315 actions (i.e. ratings), it takes 2 to 3 days for training the parameters of the model, and I cannot figure out why it is taking so much time, or it is normal?

Thank you very much for any help in advance!

SunYatong commented 6 years ago

Hi, @leo1023 . First of all congratulate you that you have implemented your model! And here are some suggestions for your experiments.

To compare the performance of different algorithms, you should compare their best performance throughout the training process, not the performance at each epoch.
Your dataset is too small which may not support the training of a neural network, try bigger datasets around 100K should be enough.
To accelerate your training, you can implement your model with some deep learning packages (e.g. TensorFlow, Keras, Theano...) and run them with GPUs.

trimkaleci commented 6 years ago

@SunYatong Thank you very much for your support!

Regarding the performance of the algorithms, with "epoch" I mean splitting the dataset into certain number of partitions - one partition corresponds to certain period of time, and thus learning the time-dependent parameters. I set the number of epochs to different values, and see whether splitting the dataset into different number of periods affects the performance of the models. Does it make sense?
I am not training a neural network, but instead I use the image features already extracted through Deep CNN - where one image is represented by 4096 values (values ranging from 0 to 1). Then the author of the paper where TVBPR is presented, he introduces a K X F embedding matrix E to linearly embed the high-dimensional image feature vector into a much lower-dimensional (i.e., K, can be set to 20, and F being the number of image features, 4096) visual style space. Afterwards, through training, we learn the values of the embedding matrix E. Do you think that still I would need a bigger dataset, or I can still have comparable results with the dataset I am using (i.e. 4315 ratings)?

Thank you very much in advance!

SunYatong commented 6 years ago

Hi, @leo1023

Sorry, I misunderstood the meaning of "epoch". Your splitting is reasonable. It seems that your model outperforms TVBPR significantly. And if you can use the same dataset as TVBPR used, the results would be more reliable.
Secondly, I still recommend you to use TensorFlow, which has some convenient API to implement and train the Matrix Factorization algorithm with pairwise loss. Furthermore GPU can accelerate your training process a lot.

guoguibing / librec

Extending timeSVD++ with a visual component #227

epochs set to 5: AUC for timeSVDpp_visual_compnent is AUC = 0.7039, AUC for TVBPR is AUC = 0.8181

epochs set to 10: timeSVDpp_visual_compnent AUC = 0.8051, TVBPR AUC = 0.6830

epochs set to 20: timeSVDpp_visual_compnent AUC = 0.7872, TVBPR AUC = 0.6815