Filtering train set in `test_avg_metrics`

quora / qmf

A fast and scalable C++ library for implicit-feedback matrix factorization models

Apache License 2.0

463 stars 97 forks source link

Filtering train set in `test_avg_metrics` #20

Open bkj opened 6 years ago

bkj commented 6 years ago

Hi all --

In some other recommender systems, there's a flag to filter the items in the training set from the test metrics -- is there something like that in qmf?

That is, it doesn't make sense to compute p@k on the test set if we allow the top-k predictions to contain items that we observed in the train set, and therefore know won't appear in the test set.

Thanks

albietz commented 6 years ago

Hi @bkj

I'm not sure I understand precisely what you're asking, but it wouldn't make sense in matrix factorization models to use items in test metrics that do not appear in training, since then you wouldn't have any latent factors to make test predictions.

Instead, the test dataset should contain (user, item) pairs that do not appear in training data, and evaluation considers ranking metrics per user on this subset of the data (filtering out all rows where user / item did not appear in the training data). The metrics are averaged over all test users, and there's an option to use a smaller number of test users, since this can be costly when there are many users.

Hope this helps.

Alberto

bkj commented 6 years ago

To compute p@k, you take the top K predictions and look at the overlap between those predictions and the actual observed values in the test set. However, the top K predictions usually contain elements that were observed in the train set, so are by definition not in the test set. Usually I take the top K predictions AFTER filtering user-items that appear in the training set, otherwise the p@k is artificially reduced. Does that make sense?

albietz commented 6 years ago

Gotcha, I wasn't aware of this optimization. Do you have pointers to papers/implementations discussing this?

-Alberto

bkj commented 6 years ago

No papers off the top of my head, but I know they do it in dsstne (and probably other places). A script to do the filtering is here.

On the example I'm running (movielen-1m), doing this filtering increases p@10 from ~0.1 to ~0.25 -- so it's a nontrivial improvement and I think the right way to do evaluation.

~ Ben

albietz commented 6 years ago

Hmm this might be worth including, but at the same time I'm not convinced that it's the right way to do evaluation either, e.g. it might artificially boost the p@k of different users in different ways depending on how many positive items appear for the user (because you would only filter positive items, and not negatives that the user may have seen).

I'd be curious to know if there is a way to estimate P@k on held-out data that is theoretically justified.

-Alberto