hyp1231 / AmazonReviews2023

Scripts for processing the Amazon Reviews 2023 dataset; implementations and checkpoints of BLaIR: "Bridging Language and Items for Retrieval and Recommendation".
MIT License
86 stars 10 forks source link

Doubts related to this repo( REQUESTING FOR URGENT HELP) #3

Open ishanki19-pixel opened 3 months ago

ishanki19-pixel commented 3 months ago

I have to build my urgent project on gnn recommendations so please clear my below doubts:

  1. Can you please edit your readme in such a way that it explaines everything in details such for example: why have you taken valid timestamp and test timestamp as constants in one of the scripts, please attach more elaboration over small ideas too so that beginners like me can also understand it properly?

  2. Can you also tell me how can I build features file from the metadata provided as I am working on a GNN project for recommendations so I want to know how can I process your data from csv files to txt files containing nodes, edges and edge types. For that reason , I am not able to understand how can I process data for my gnn project?

hyp1231 commented 3 months ago

Thanks for your interest in our 2023 version dataset!

A1: Please refer to our paper (Section 3.1 Data Processing) and our website (Absolute-Timestamp Splitting) about why we use absolute timestamps, and how we choose these specific absolute timestamps.

To be specific:

Recommender systems in the real world only access interactions that occurred before a specific timestamp, and aim to predict future interactions. To better align with such scenarios, we split the reviews into training, validation, and test sets by absolute timestamps, rather than predicting the latest few interacted items of each user. To be specific, we find two timestamps and split all the reviews in a ratio of 8 : 1 : 1. These two timestamps are used to split data for both pretraining and all downstream evaluation tasks.

This strategy aligns with real-world scenarios but is not widely used in research. Researchers are encouraged to experiment with this splitting strategy. Specially, given a chronological user interaction sequence of length N:

  • Training part: item interactions with timestamp range (-∞, t_1);
  • Validation part: item interactions with timestamp range [t_1, t_2);
  • Testing part: item interactions with timestamp range [t_2, +∞).

Also thanks for your suggestion, we will update our README to make it more clear.

A2: The file format of our 2023 version is basically the same as our previous version. Although we do not provide such scripts due to they are too specific, we suggest you refer to the GitHub repositories of GNN-based recommendation methods that also use Amazon Reviews datasets (even if the used versions are 2014/2018, they should be easily adapted to 2023 version).