Open MaiziXiao opened 3 years ago
This is a little unexpected. Could you try using the BruteForce
layer for your evaluation, for comparison?
@maciejkula After using BruteForce
layer I get the same result as my first approach showed above. TBH, I didn't really understand the difference between setting the candidates
equal an explicit brute_force layer and a set of item embeddings.
Code I use:
brute_force = tfrs.layers.factorized_top_k.BruteForce()
brute_force.index(item_embeddings)
model.task.factorized_metrics = tfrs.metrics.FactorizedTopK(
candidates=brute_force
)
model.compute_metrics = True
model.compile()
%time brute_force_result = model.evaluate(cached_test, return_dict=True)
brute_force_result
@maciejkula I have a bit of suspicion here, I hope if you can maybe confirm my thinking. During my playing around with the model, I experienced few times that my top100 accuracy on validation set is 1 which means a perfect model. After looking into the model, I realize that no matter what input you give, the candidate embedding will be the same. That means no matter what you validation data is, it is always at the top1 position (since every item has the same embedding). I am trying to go into the source code (mostly the calculation of the metric: https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/metrics/factorized_top_k.py#L67 and the logic of the TopK layer: https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/factorized_top_k.py#L242) I have to say it is still not very clear to me if this is the issue. I am wondering if it is true if the output are all the same, the metrics will be 1 (that may also explain another issue I opened: https://github.com/tensorflow/recommenders/issues/263 since the TopK metric could be misleading and not aligned with the val loss). And also, do you have some experience why the output will be the same no matter what the input is? It looks to me changing a tiny bit of HP can prevent this issue from happening (switched from Adam to Adagrad in my case).
Following the documentation (https://www.tensorflow.org/recommenders/examples/efficient_serving#evaluating_the_approximation), I am trying to compare the performance the ScaNN evaluation and BruteForce evaluation.
Code for BruteForce evaluation:
which gives me result:
{'factorized_top_k/top_1_categorical_accuracy': 0.0016201249090954661, 'factorized_top_k/top_5_categorical_accuracy': 0.006863438058644533, 'factorized_top_k/top_10_categorical_accuracy': 0.01172381266951561, 'factorized_top_k/top_50_categorical_accuracy': 0.03605514392256737, 'factorized_top_k/top_100_categorical_accuracy': 0.05449511110782623, 'loss': 9786.341796875, 'regularization_loss': 0, 'total_loss': 9786.341796875}
Code for ScaNN evaluation:
And I get:
{'factorized_top_k/top_1_categorical_accuracy': 0.004065040498971939, 'factorized_top_k/top_5_categorical_accuracy': 0.012725344859063625, 'factorized_top_k/top_10_categorical_accuracy': 0.019765524193644524, 'factorized_top_k/top_50_categorical_accuracy': 0.05107811838388443, 'factorized_top_k/top_100_categorical_accuracy': 0.07484976947307587, 'loss': 9786.341796875, 'regularization_loss': 0, 'total_loss': 9786.341796875}
Is there any good reason that the ANN approach is performing better than a BruteForce approach?