ScaNN evaluation gives better metrics than BruteForce evaluation

MaiziXiao commented 3 years ago

Following the documentation (https://www.tensorflow.org/recommenders/examples/efficient_serving#evaluating_the_approximation), I am trying to compare the performance the ScaNN evaluation and BruteForce evaluation.

Code for BruteForce evaluation:

item_embeddings = items_data.batch(batch_size).map(model.candidate_model)
# Override the existing streaming candidate source and turn on the compute_metrics
model.task.factorized_metrics = tfrs.metrics.FactorizedTopK(
    candidates=item_embeddings
)
model.compute_metrics = True
# Need to recompile the model for the changes to take effect.
model.compile()
%time results_eval = model.evaluate(cached_test, return_dict=True)
results_eval

which gives me result: {'factorized_top_k/top_1_categorical_accuracy': 0.0016201249090954661, 'factorized_top_k/top_5_categorical_accuracy': 0.006863438058644533, 'factorized_top_k/top_10_categorical_accuracy': 0.01172381266951561, 'factorized_top_k/top_50_categorical_accuracy': 0.03605514392256737, 'factorized_top_k/top_100_categorical_accuracy': 0.05449511110782623, 'loss': 9786.341796875, 'regularization_loss': 0, 'total_loss': 9786.341796875}

Code for ScaNN evaluation:

scann = tfrs.layers.factorized_top_k.ScaNN(num_reordering_candidates=1000)
scann.index(item_embeddings)
model.task.factorized_metrics = tfrs.metrics.FactorizedTopK(
    candidates=scann
)
model.compute_metrics = True
model.compile()
%time scann_result = model.evaluate(cached_test, return_dict=True)
scann_result

And I get: {'factorized_top_k/top_1_categorical_accuracy': 0.004065040498971939, 'factorized_top_k/top_5_categorical_accuracy': 0.012725344859063625, 'factorized_top_k/top_10_categorical_accuracy': 0.019765524193644524, 'factorized_top_k/top_50_categorical_accuracy': 0.05107811838388443, 'factorized_top_k/top_100_categorical_accuracy': 0.07484976947307587, 'loss': 9786.341796875, 'regularization_loss': 0, 'total_loss': 9786.341796875}

Is there any good reason that the ANN approach is performing better than a BruteForce approach?

maciejkula commented 3 years ago

This is a little unexpected. Could you try using the BruteForce layer for your evaluation, for comparison?

MaiziXiao commented 3 years ago

@maciejkula After using BruteForcelayer I get the same result as my first approach showed above. TBH, I didn't really understand the difference between setting the candidates equal an explicit brute_force layer and a set of item embeddings. Code I use:

brute_force = tfrs.layers.factorized_top_k.BruteForce()
brute_force.index(item_embeddings)
model.task.factorized_metrics = tfrs.metrics.FactorizedTopK(
    candidates=brute_force
)
model.compute_metrics = True
model.compile()
%time brute_force_result = model.evaluate(cached_test, return_dict=True)
brute_force_result

MaiziXiao commented 3 years ago

@maciejkula I have a bit of suspicion here, I hope if you can maybe confirm my thinking. During my playing around with the model, I experienced few times that my top100 accuracy on validation set is 1 which means a perfect model. After looking into the model, I realize that no matter what input you give, the candidate embedding will be the same. That means no matter what you validation data is, it is always at the top1 position (since every item has the same embedding). I am trying to go into the source code (mostly the calculation of the metric: https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/metrics/factorized_top_k.py#L67 and the logic of the TopK layer: https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/factorized_top_k.py#L242) I have to say it is still not very clear to me if this is the issue. I am wondering if it is true if the output are all the same, the metrics will be 1 (that may also explain another issue I opened: https://github.com/tensorflow/recommenders/issues/263 since the TopK metric could be misleading and not aligned with the val loss). And also, do you have some experience why the output will be the same no matter what the input is? It looks to me changing a tiny bit of HP can prevent this issue from happening (switched from Adam to Adagrad in my case).

tensorflow / recommenders

ScaNN evaluation gives better metrics than BruteForce evaluation #264