Excluding previously seen items from test recommendations

nialloh23 commented 4 years ago

In other libraries (e.g. Lightfm) it's common to have a facility to exclude previously seen items from test\eval reccomendations by passing in a set of train_interactions into our evaluation method (see lightfm approach). This mimics what we would do in production and as such gives us a more accurate read evaluation results in our offline training. Without this ability we would find it hard to know if we are overfitting on our training data as the eval results become a lot less meaningful when your real performance is being masked.

I noticed in the tutorials provided it explicitly states that this is approach has not been adopted for TFRS and that we should "appropriately specify models to learn this behaviour automatically".

What does an appropriately specified model in this instance look like? (e.g. do we need to capture sequence behaviour via RNNS (e.g. like in this paper)? Should including timestamp data as context info in our queries capture this behaviour?)..... is there an intention to provide more details on this?
Is it envisaged that a provision will be made to exclude previously interacted with items (or is this a decision that's dictated by the modelling approach?)

maciejkula commented 4 years ago

There are a couple of reasons why we do not offer this functionality by default:

In many systems it is perfectly appropriate to re-recommend items, including videos (re-watches are common) and e-commerce (re-purchases).
In large system keeping a record of past items in memory is problematic: there could be hundreds of millions of users and items. This is easier for packages like LightFM where the interaction matrix is always kept in memory

In terms of having reliable offline results, this should give you overly optimistic results because of overfitting. If anything, it will make your results pessimistic if your test items are always new interactions.

Having said that, I understand that this is a common method of evaluation. Sadly, we haven't so far found a good way of implementing this in a general case. One option would be for the training dataset to contain, for each interaction, a record of the user's past interactions at that time. We could then exclude those from evaluation like so:


for (user_id, item_id, past_interaction_item_ids) in data:

  user_embeddings, item_embeddings = ....

  self.task.compute_loss(user_embeddings, item_embeddings, exclude_candidates=past_interaction_item_ids)

Would that fit your use case?

nialloh23 commented 4 years ago

That all makes sense.

Yes, that solution you proposed would work for us. We could just store historical item interactions for each user in an array along with each user, item interaction pair.

I assume these would then be excluded from loss & metric calculations?

maciejkula commented 4 years ago

Yes, that's the idea.

I think this is quite tricky and labour-intensive to implement, so I'm going to add this to our backlog for prioritization (we're adding a lot of exciting features in our next release).

If you need this now, you could perform the evaluation yourself, roughly as follows (in loose pseudocode):

index = tfrs.ann.BruteForce(candidates)

metric = tf.keras.metrics.Mean()

for batch in test_data.batch(...):
  query, positive, seen = batch

  query_embedding = model.query(query)
  positive_embedding = model.candidate(positive)

  # Get top K unfiltered candidates.
  scores, ids = index(query)

  # Filter them with the `seen` ids
  ...

  metric.update(positive in filtered_top_ids)

Would this work for you?

nialloh23 commented 4 years ago

Thanks Maciej, I've tried to implement this approach in the short term. I managed to create the query, positive, seen batches but ran into difficulties with the filtering process. Main complications seem to arise from the fact that I will have (1) ragged tensors (2) attempting to filter after batching -> seems to introduce tensor shape mismatch issues that I cannot resolve.

I ended up resorting to a more manual implementation of the above which takes batch(1) at a time & converts to numpy arrays. It is slow but it will work.

Side note which may be of interest:
Without this functionality it's very hard to determine the optimal number of look back days for your training dataset.
In my tests as you increase the look back period (e.g. from 30 days -> 60 days -> 90 days) my results get worse due to the fact that historic interactions are crowding out actual new predictions from top_k acc results. Without this custom hack it makes it very hard to optimize the training process for any reco system that doesn't want to reccomend the things which have already been watched, worn, bought, read etc. in the past

maciejkula commented 4 years ago

Thanks for looking into this, and thank you for motivating this use case well: I definitely see what you mean.

As you have discovered the filtering is quite tricky to get right, but I think I have a way forward. I'll ping you here once I have some code to share.

nialloh23 commented 4 years ago

Thanks Maciej! I appreciate you taking the time to look into this

AlekseiKrukowski commented 3 years ago

Hello, Is there a method to exclude watched movies / purchased items from test recommendations? I do not find in the documentation and here this question is open

yrianderreumaux commented 2 years ago

@maciejkula I am also interested in this solution.

yrianderreumaux commented 2 years ago

@AlekseiKrukowski Did you find any info on this?

nate-walter commented 2 years ago

Hey there @nialloh23, I'm working on a retrieval model using TFRS and am looking to exclude any products that the customer already owns (there are only ~45 products) in the validating/testing phase. It sounded like @maciejkula 's solution worked for you. I was wondering if you wanted to share any pitfalls to look out for along the way that you might have encountered and overcome in the mean time.

tensorflow / recommenders

Excluding previously seen items from test recommendations #113