Sequential Recommendation

juliobguedes commented 3 years ago

There are a few independent implementations for sequential recommenders (sequence-aware, session-based, and so on) such as slientGe's repository. How do we perform such recommendations using this library?

Every example in the guides focuses on solving the matrix-completion problem, while this is not the objective in a sequential recommendation.

maciejkula commented 3 years ago

To use information about a user's sequence of interactions in TFRS, you need to write the appropriate user model.

For example, suppose your training dataset contains tuples of ([past_watches_titles, new_watch_title]), where past_watches_titles is a sequence of past watches for any user, and new_watch_title is what we are trying to predict.

If that's the case, you could set up a model like this and train and evaluate it exactly like any of the tutorial models:

class Model(tfrs.Model):

  def __init__(self):
    self._movie_model = tf.keras.layers.Embedding(...)

    self._user_model = tf.keras.Sequential([
      # Look up embeddings of past watches.
      tf.keras.layers.Embedding(...),
      # Summarize them using, for example, a recurrent layer.
      tf.keras.layers.GRU(),
      # Perhaps more layers here.
     ])

   self._task = tfrs.tasks.Retrieval(...)

  def compute_loss(self, inputs, training=False):
    past_watches, new_watch = inputs

    new_watch_embedding = self._movie_model(new_watch)
    user_embedding = self._user_model(past_watches)

    return self._task(user_embedding, new_watch_embedding)

juliobguedes commented 3 years ago

Thanks for your reply, it is very enlightening. A few more questions regarding this topic:

If my dense layer had only one unit, it would be performing a classification problem rather than a recommendation one, correct? Also meaning that I would likely use a loss like sparse cross-entropy
Considering the answer to the previous question as yes, the dataset I am working with must be something like the example below, thus the dense layer would have the same size as the output, correct?
- input: [(id1, id2, id3, ...), output: (1st_next, 2nd_next, 3rd_next, ...)

At last but not least, after all these questions are clear, can I open a PR adding it to the documentation?

maciejkula commented 3 years ago

You may have to define for me what you mean by "dense layer" in your question.

I'll try to clarify in the meantime:

If you're using the Retrieval task, you're building a two-tower retrieval model using an in-batch softmax loss, where user-item scores are given by the dot product of their embeddings. If your embeddings are one-dimensional the model will be extremely poor quality. In the example I gave the output of self._movie_model and self._user_model should be k-dimensional embeddings (of size [batch_size, embedding_dim]. The GRU layer by default performs a reduction along the sequence dimension). Common choices for k are between 64 and 256.
If you're building a model that does binary classification (say, click prediction), you'd likely use the Ranking task, as per this tutorial.

Let's figure this out first - once we do, we can think about documenting this.

juliobguedes commented 3 years ago

By dense layer with one unit, I mean a Dense(1) as the output layer.

I am finally hands-on in deep learning, but almost everything I know is research-related, where many things become implicit about the implementation. It may be more appropriate for me to explain my goal:

I am trying to replicate GRU4Rec using tensorflow recommenders, a session-based recommendation model which takes the N most recent items to predict a list L of next items.
Some things about the implementation are becoming quite messy to understand because I'm thinking about this model as being very close to a text generation model (something like Tokenize (in this case using StringLookup) -> Embedding -> GRU -> Dense). If the length of L is 1, I would be trying to predict only the next item and this seems close to a classification problem where a loss like sparse cross-entropy is suitable, but if len(L) > 1, to keep using this template I would have to add the first prediction (L[0]) in the input to predict the next item (L[1]), and so on.
Differently from the MovieLens dataset, I am using the Spotify MPD but the only feature I am using is the track URI (track Id). So, my input is something like track_0001 track_4978 track_2312 ... track_2220, transformed in a list of ints by the string lookup, and my output is also a sequence of track ids. The metrics I will use are something like Recall, MRR, and NDCG (also present in tensorflow_ranking, as I'm aware of).

maciejkula commented 3 years ago

Got it. To the best of my knowledge, the example I wrote above gives you a more-or-less full implementation of GRU4Rec, assuming you transform your data into pairs of (tracks_up_to_track_at_t, track_at_t), so that you predict one future item at a time. You can think of it as always doing one-ahead prediction.

If you're predicting the next item, you're (to a first approximation) fitting a retrieval model, for which you'd use the Retrieval task. It takes as inputs the user/query embedding as well as the embedding of the item you are trying to predict. Internally, it will compute (1) a loss that maximizes the likelihood of predicting the next item relative to all other items, and (2) some top-K accuracy metrics. Because it already computes a loss, you do not need any additional losses.

For your model, then you need to create a Retrieval task and feed it with the user embedding and item embedding; both of these will be [batch_size, embedding_dim]. Your output layer is not and cannot be a Dense(1) layer; it has to be a reasonably wide embedding.

Looking at my example, your user_model and item_model both have to output embeddings of the same size, which are then fed to the task object. As long as that's true, you can have any layers you like in your submodels: in this case, we have recurrent layers in the user model (to consume an interaction history) and standard embeddings in the item model.

In terms of metrics, TFRS's default FactorizedTopK metrics will give you the likelihood that the true next item is in the top K items that you predict. This should be enough to get you started, and is quite efficient to compute.

anisayari commented 3 years ago

Thank you @maciejkula for all those clarifications. @juliobguedes I am currently working on the same usecase but for retail purchases. I will keep you updated if I make a good notebook example. I will PR it in example if he can be useful for other people. :-)

juliobguedes commented 3 years ago

I spent some time debugging my code (I had a few bugs in my own implementation) but I understood your comments and I am trying to implement them. I am having one more problem: when I am training the model, I always get this error and the model never gets past the first epoch.

Epoch 1/10 WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32 WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss. WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32 WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.

After these prints, I have no updates for quite some time, leading me to stop the execution.

Let me add a few things that may help to figure out the problem. As previously stated, my input is a sequence of track ids such as track_0001 track_4978 track_2312 ... track_2220, but I am padding all sequences to have the same length using pad_sequences, but pad_sequences only take sequences of ints, and I have to apply the vocab earlier. Something like this:

vocab = StringLookup(vocabulary=vocabulary)
sequences = [vocab(seq) for seq in input_seq]
padded = pad_sequences(sequences, maxlen=seq_len)

After this process:

My input is [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 695454 984538 1003339 402440 1172937]
My output still is track_2312

I have coded the model similarly to your example:

class GRU4Rec(TfrsModel):
    """
    Defines a GRU4Rec Tensorflow model. This model was initially
    described by Hidasi et al in ...
    """
    def __init__(
        self, ds, embedding_size=100,
        gru_units=100, gru_activation='tanh',
        dense_units=100, dense_activation='softmax',
        loss='cross-entropy'):

        super().__init__()
        self.embedding_size = embedding_size
        self.gru_units = gru_units
        self.gru_activation = gru_activation
        self.dense_activation = dense_activation
        self.vocab = ds.vocab
        self.seq_length = ds.sequence_length
        self.num_users = ds.num_users
        self.num_items = ds.num_items
        self._build_model(ds.candidates())

    def _build_model(self, candidates):
        self.user_model = Sequential([
            Embedding(self.num_items, self.embedding_size, input_length=self.seq_length),
            GRU(self.gru_units, activation=self.gru_activation), # HIDDEN LAYER
        ], name='User_Model')
        self.item_model = Sequential([
            self.vocab,
            Embedding(self.num_items, self.embedding_size)
        ], name='Item_Model')
        self.task = Retrieval(metrics=FactorizedTopK(candidates.map(self.item_model)))

    def compute_loss(self, inputs, training=False):
        watching_history, next_item = inputs

        history_embedding = self.user_model(watching_history)
        next_item_embedding = self.item_model(next_item)

        return self.task(history_embedding, next_item_embedding)

So, why isn't my model fitting? Please let me know if I should add any other details

maciejkula commented 3 years ago

Thanks for the detailed description: would you mind putting it into a colab with some dummy data so that I could run it?

maciejkula commented 3 years ago

Off the top of my head, this is what I think what might be happening:

The dataset (ds) you use for your candidates is all of your data instead of a dataset that contains each of your tracks once: this would make the evaluation very slow.
If you have a large dataset, it makes sense not to compute evaluation metrics (which are expensive) during training, but only during evaluation. To do so, call return self.task(history_embedding, next_item_embedding, compute_metrics=not training) on the last line of compute_loss.

Please give these a go and let me know!

juliobguedes commented 3 years ago

Understood. I am doing this right now and will get back to you soon.

juliobguedes commented 3 years ago

Ok, sorry for taking so long. Here is the link for the colab: https://colab.research.google.com/drive/1pmhsZxG_rVij7hT3qCt-_8xARb-Aq4aF?usp=sharing

I was able to make a similar dataset (although way smaller) using lastfm 1k users dataset. Using colab, it got past the first epoch, but the results are still not correct

maciejkula commented 3 years ago

Thanks for the notebook!

When you say that the results are not correct, what do you mean?

juliobguedes commented 3 years ago

~~Apparently only the loss changes, the metrics are not being computed (i guess?)~~

Ignore my last statement. The metrics are being computed, it just wasn't visible due to the param you pointed out earlier.

A new question: To predict the next item, how would I use the user_model and item_model? Do I have to add dense layers as in here?

maciejkula commented 3 years ago

Right, that's because we turned off the metric computation during training - if you run model.evaluate(...), the metrics will be computed.

There are a couple of things wrong here:

When creating the candidates dataset, your elements are vectors of length one instead of scalars. Replace np_candidates = np.reshape(list(self.vocab_set), (-1, 1)) with np_candidates = list(self.vocab_set).
When passing candidates to FactorizedTopK, you should batch it. Right now, it is processing one element at a time; with the fix above, you can batch it at 1024 or larger. This should really help with speeding up evaluation (so candidates=candidates.batch(1024).map(self.item_model)).

With these two changes I get reasonable-looking evaluation metrics, with the whole evaluation loop taking about 30 seconds.

juliobguedes commented 3 years ago

Nice, that's awesome, I am now able to run it, it still takes some time since I am not running it on a GPU.

Considering a real research scenario: I have the ground-truth for the next K predictions in my datasets and would like to compute metrics such as Recall@K or NDCG@K. How would I do that?

Thank you so much for your help so far, I'm really learning a lot here :)

maciejkula commented 3 years ago

If you're content in doing one-ahead predictions, you can pass custom metrics to FactorizedTopK: have a look at https://github.com/tensorflow/recommenders/issues/118.

To predict the next item, I'd suggest using the BruteForce layer: docs.

juliobguedes commented 3 years ago

So, I was able to use the BruteForce layer based on the tutorial examples. I am using the very same model I posted earlier in the colab. One weird thing that keeps happening is: while my training loss decreases, my validation loss increases, and I can only think of this being related to these messages:

Epoch 1/10
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:Gradients do not exist for variables ['counter:0'] when minimizing the loss.

What causes them? They are even present in quickstart's last cell output, and regularization loss is always regularization_loss: 0.0000e+00

loss

maciejkula commented 3 years ago

How are you computing your validation loss? Are the test metrics on the validation set also deteriorating over time?

I suspect that regularization_loss is a red herring. It'll only be non-zero if you add any regularization to your model.

juliobguedes commented 3 years ago

As we talked earlier, I am using the last N interactions to predict the N+1th interaction. In order to create a validation set, I applied the same idea by using the last N-1 interactions to predict the Nth during training, and the last N to predict the N+1 during validation.

To ensure that the parameter compute_metrics was not interfering, I turned it off and increased the number of epochs before plotting the loss and the metrics (I am using MRR). You can check the outputs below.

mrr_metric loss

maciejkula commented 3 years ago

It looks like your validation doesn't run at all - certainly the metrics aren't updated (the symptom is the metric values being all zero). The metrics are computed on the training set, however.

Did you set compute_metrics = not training?

What do you get if your run it explicitly via model.evaluate(validation_set)?

juliobguedes commented 3 years ago

I didnt set compute_metrics = not training. When i ran model.evaluate(test_set), which is around the same as evaluating the validation set, i got this:

57/57 [==============================] - 32s 570ms/step - mrr_metric: 0.5678 - loss: 43.4940 - regularization_loss: 0.0000e+00 - total_loss: 43.4940 - val_mrr_metric: 0.0119 - val_loss: 414.3062 - val_regularization_loss: 0.0000e+00 - val_total_loss: 414.3062
Epoch 20/20
57/57 [==============================] - 33s 571ms/step - mrr_metric: 0.5732 - loss: 34.5740 - regularization_loss: 0.0000e+00 - total_loss: 34.5740 - val_mrr_metric: 0.0120 - val_loss: 412.2416 - val_regularization_loss: 0.0000e+00 - val_total_loss: 412.2416
 Starting Evaluation
57/57 [==============================] - 10s 177ms/step - mrr_metric: 0.0103 - loss: 142.7616 - regularization_loss: 0.0000e+00 - total_loss: 142.7616

I'm trying to run again using topk instead of mrr

maciejkula commented 3 years ago

The numbers you have here suggest severe overfitting - the MRR metric is much higher on the training set than the evaluation set.

I'm trying to get at disentangling two things here:

Is there something wrong with the computation of validation metrics? If you have compute_metrics=True and pass your validation set to model.fit via the validation_data argument, I would expect both the training and validation curves to be non-flat.
Is there something wrong with the model, where it overfits on the training data?

If (1) doesn't happen, it might be that there is a bug in the library. (2) is a tuning/modelling issue.

juliobguedes commented 3 years ago

I thought about being overfitting, but I just can't understand how and why it is overfitting, as you can see in the plot I am using only 20 epochs, but in the 5th the validation loss is already increasing.

My model is almost the same thing you posted in your very first comment:

user_model is Embedding(d=100) -> GRU(100);
item_model is vocab -> Embedding(100);
task is Retrieval -> FactorizedTopK -> MRR

In the plots from my last comment, compute_metrics was set to default (True).

I ran again using the same data and topk, and these are the results:

top_k_categorical_accuracy loss

You can check the complete logs here. I have my code in a private github repo, but I can make it public so you can see it, if its necessary

maciejkula commented 3 years ago

Cool, thanks for the details.

I'm afraid this does look like overfitting!

A couple of things to try:

Get a bigger dataset! This one looks very small.
Change the embedding size to something much smaller. Start with 32?
Add regularization.
Perhaps there's something wrong with your data: with sequential data it's easy to accidentally allow the model to cheat in training.

juliobguedes commented 3 years ago

I was running with a sample of 1% of the dataset, but I had around 5700 examples for training and validation, and 2k examples for testing. I have now changed to 10% of the dataset, with 72k examples for training and validation, and 18k examples for testing. I also added your suggestion about reducing the embedding dim to 32. It will take some time before I can get back to you.

About the 3rd point, how do I add regularization in a single gru layer? Dropouts aren't appropriate for this case if I remind them correctly. Perhaps dropouts between the gru units would be appropriate but I would have to manually create the layer, right? I figured it out already.

About the 4th point, I believe the data is correct, I have checked it multiple times, but if I still can't solve the problem, I'll get back to that.

juliobguedes commented 3 years ago

So, I ran again following your comments, and you can see the complete logs here. There is a memory error at the end, but it's not related to the model itself. I already tried to fix it and we can ignore it and I'm waiting for output to see if it was fixed.

As you stated, I have now started to agree that the model is easily overfitting, which would seem odd if I was using the entire dataset and not a sample of it. Here are the results:

top_k_categorical_accuracy loss

If the validation loss makes sense to you as being simply an overfit, I am ok with closing this issue, but I wonder if only the overfit would make the validation topk not having a considerable increase (epoch 1 was 0.0002, epoch 10 was 0.0031).

anisayari commented 3 years ago

Hi @juliobguedes I am currently facing the same kind of issue with a similar architecture did you have any updates about the way you worked around this problem ? thank you

juliobguedes commented 3 years ago

Hi @anisayari, I have worked with this architecture using TF Recommenders for almost 2 months, but I was not able to achieve any results. I tried to change the architecture (increase the number of GRU layers, regularization, and so on) and increase the dataset, but none of the changes seemed to result in metrics better than 10^-4

If you find any results better than this, Id like to understand your thoughts and implementation

maciejkula commented 3 years ago

This is somewhat surprising to me. Julio's code here seems solid, and the evaluation metrics on the training set make sense.

Julio, how do you do your train/test split? Is it possible that the candidates you are trying to predict in the test set differ significantly from your test set? It could be that the majority of the test set targets aren't in the training set. For example, you perform your train/test splits by time, and the test candidates are mostly new items not represented in the training set. You could have a look at the 20 most popular targets in the train and test sets; if they do not overlap, you have a lot of distribution shift.

One other thing to look at is your vocabulary construction. What proportion of the test set maps to the OOV bucket?

juliobguedes commented 3 years ago

Let me explain my entire process.

I started using the Spotify MPD, but it was taking a long time to run, so I switched to my second choice, Last.fm 1k, which has:
- 992 unique users
- 1,083,472 unique items
- 19,150,868 unique events
I applied a layer of preprocessing, removing users a small number of events, and took a sample of 10% of the remaining dataset, since I was not able to run it for a long time. In the end, I had 629 unique users, 240,195 unique items, and 880,992 unique events.
Since I was targeting a Session-based Sequential Recommendation, I structured the dataset as I mentioned earlier in this issue: a string of space-separated ids, such as track_0001 track_4978 track_2312 ... track_2220, and 18k sessions were generated from that 10% of the data.
Considering that in Session-based Recommendation the models are user-agnostic (i.e. it only considers the sequence of items, regardless of a user identification), I tried 3 different strategies for train/validation/test split, considering a sequence to be S of length k:
- Last N Interactions: N is an inputted number that defines the length of the item sequence that will be used as input, so the [ s{k - N - 3} ... s{k-3} ] sequence will be used to predict s{k-2} during training, the sequence [ s{k - N - 2} ... s{k-2} ] to predict s{k-1} during validation, and [ s{k - N - 1} ... s{k-1} ] to predict s_{k} during test.
- Previous: Last N with N = 1.
- Sliding Window: Used a sliding window to increase the number of samples, kept the w-1th window for validation and the wth for testing.
After some bad results, I realized that I was facing the cold-start problem as well, and added another preprocessing layer to remove any items of the test set that were predicting a non-existing item in the training/validation set.
The results kept as bad as before, but I had no more time to keep trying different settings and changes.

I have not checked the distribution of items between training/validation and testing. I still have to modify a few things in my code before open-sourcing it, but that was it.

One thing very important in Sequential Recommendation is to use the proper losses, but they aren't yet implemented in TF Recommenders. I tried to use the losses implemented in TF Ranking, but the results were the same.

Please let me know if I can help with something else.

maciejkula commented 3 years ago

Thank you for the detailed write-up!

In the end, how many training sequences did you have? Based on the dataset statistics you shared, I would say that your data is simply too sparse to train an effective model. You get great training accuracy because GRUs + embeddings are excellent at memorization - but you will get terrible test accuracy, because you will be asking the model to extrapolate far beyond what you have seen.

The Movielens 20M dataset has 20M interactions with 27,000 movies; your dataset has 20M interactions with 1M items. This makes your dataset 40 times sparser.

How many sequences did you have in your largest dataset? It's hard to say what would be a sufficient number of observations here, but I would look to have at least 100M sequences.

@anisayari I think Julio's approach here is good, and you could follow his steps. I think the main problem is simply not enough data.

@juliobguedes what do you mean when you say "proper losses"? The in-batch softmax loss implemented by the Retrieval task is a very good choice here, and I would be surprised to see any improvement from TF Ranking losses.

juliobguedes commented 3 years ago

I am explaining a lot of things how I understood them by the papers and code I read. You may know this better than me and not need the explanation, and if so, I'd appreciate any corrections.

One of the ideas of Sequential Recommendation when avoiding matrices (matrix factorization and so on) is that it is also able to solve the recommendation problem with sparse datasets. We can see this in the GRU4Rec paper which performs their experiments using 2 datasets, one not that sparse (31M interactions with 37k items) and another sparse (13M interactions with 330k items), but the performance is better in the sparser one. Link to the paper.

My comment about a proper loss is that I don't see how matrix factorization loss/learning would help the sequential recommendation problem, since I don't understand how the FactorizedTopK works, but still used as a black-box layer. I only have minor knowledge about this, the GRU4Rec paper also tried using softmax as its loss but found top1 and BPR to achieve better results. Sorry that I cannot contribute more to this point.

Let me know if there is anything else that I can help with.

ydennisy commented 3 years ago

@juliobguedes @anisayari it has been some time since this issue was opened, but I was wondering if you solved your problems, and if yes - how?

From reading this, and watching the GRU talk, a thought comes to mind - could the issue be with the user model? The idea is that sessions are anonymous, so learning an embedding per user (but in reality this is a session) could lead to the severe overfitting?

@maciejkula what are your thoughts on this? If the above could be the issue - how would you maybe suggest to "smooth" the sessions?

Thanks in advance!

YannisPap commented 3 years ago

@juliobguedes I had a look at your code because I'm trying a similar implementation. I may have found a mistake in your code.

def _build_model(self, candidates):
        self.user_model = Sequential([
            Embedding(self.num_items, self.embedding_size, input_length=self.seq_length),
            GRU(self.gru_units, activation=self.gru_activation), # HIDDEN LAYER
        ], name='User_Model')
        self.item_model = Sequential([
            self.vocab,
            Embedding(self.num_items, self.embedding_size)
        ], name='Item_Model')
        metrics = [TopK(), tfr.keras.metrics.MRRMetric()]
        topk = FactorizedTopK(candidates.batch(1024).map(self.item_model), metrics=metrics)
        self.task = Retrieval(metrics=topk)

Since the input to your user_model is a sequence of the past items, you should apply the same vocabulary there. Otherwise, the items in the user and item models are unrelated. In other words, I believe you should add self.vocab prior to your embeddings layer in your user_model.

@maciejkula if you would be so kind, please let me know if I'm right or I have got it all wrong.

tensorflow / recommenders

Sequential Recommendation #119