Amazon-C4 Model Evaluation

hyp1231 / AmazonReviews2023

Scripts for processing the Amazon Reviews 2023 dataset; implementations and checkpoints of BLaIR: "Bridging Language and Items for Retrieval and Recommendation".

MIT License

106 stars 13 forks source link

Amazon-C4 Model Evaluation #13

Open nkn002 opened 1 month ago

nkn002 commented 1 month ago

Hi,

Thanks for the repo! One thing that is unclear to me is how you evaluated your models on the Amazon-C4 dataset. In your repo, you mentioned that _"sampled_item_metadata1M.jsonl contains ~1M items sampled from the Amazon Reviews 2023 dataset. For each <query, item> pairs, we randomly sample 50 items from the domain of the ground-truth item. This sampled item pool is used for evaluation of the BLaIR paper. Each line is a json", however it remains unclear to me how the evaluation is done. If I understand correctly, each query will have one exact match, so how exactly are you calculation the nDCG@100? Can you provide the code for that if possible since I couldn't find it on your repo.

Thanks!

hyp1231 commented 1 month ago

Hi, for each query, the candidate item pool consists of ~1M items. We calculate nDCG@100 to evaluate how effectively the model retrieves the one ground truth item from this ~1M item pool.

By “Sampling 50 items from the domain of the ground-truth item,” we mean that the distributions of domains (or categories) are consistent between (1) ground-truth items and (2) candidate items.

Thank you for bringing this to our attention! We will update the setting description to make it clearer.

XiaoxinHe commented 3 days ago

Hi,

Thank you for the great work and for making this dataset available.

I was wondering if you could share the code used for the Amazon-C4 model evaluation. I have been trying to replicate the results from Table 7 but noticed a performance gap, which might be due to differences in experimental settings. For instance, I noticed that the sequence length for training BLAIR is set to 64. Could you clarify the maximum sequence length used during evaluation?

Additionally, would it be possible to provide the ESCI dataset linked with the Amazon Review 2023 dataset, along with the corresponding candidate pool? Having access to these would greatly facilitate reproduction of the results.

Thank you in advance for your consideration and support.