hyp1231 / AmazonReviews2023

Scripts for processing the Amazon Reviews 2023 dataset; implementations and checkpoints of BLaIR: "Bridging Language and Items for Retrieval and Recommendation".
MIT License
86 stars 10 forks source link

categories field is empty. #12

Closed DataLama closed 2 weeks ago

DataLama commented 2 weeks ago

First of all, thank you for sharing such great data.

I wanted to see the distribution of product categories from the item metadata, so I checked the category field, but all the data is empty.

from datasets import load_dataset
ds = load_dataset('McAuley-Lab/Amazon-Reviews-2023', 'raw_meta_All_Beauty', split="full", trust_remote_code=True)

It looks like you have collected category data according to other issues(https://github.com/hyp1231/AmazonReviews2023/issues/7), so I would like to know why the data is missing.

hyp1231 commented 2 weeks ago

Hi, thanks for your interest in our dataset!

I manually checked several items in the domain All_Beauty, and it seems that originally there were no categories for these items on the website examle 1, example 2, example 3. I guess for some reasons, all the items under All_Beauty were not assigned with corresponding categories.

Please let me know if you find some items under the main_category == All_Beauty has categories on the corresponding webpage, thx!

If you wonder which part of the Amazon webpage is collected as the main_category or categories field, please refer to #7

DataLama commented 2 weeks ago

Thank you for you explanation. Unfortunately, there is nothing we can do if there is no data.