question data preprocessing

twchen / lessr

Handling Information Loss of Graph Neural Networks for Session-based Recommendation

MIT License

67 stars 19 forks source link

question data preprocessing #3

Closed Rosarubu closed 3 years ago

Rosarubu commented 3 years ago

Hi, I have a question about data preprocessing. Why you use different process for different dataset? Like, for Diginetica, you didn't remove the immediate repeat. And for gowalla and lastfm, short sessions were not removed. Thanks in advance.

twchen commented 3 years ago

There are several reasons. The first reason is that Diginetica has sessions already while the other two datasets don't, so two different processes were used. The second reason is that the processes in SRGNN and in RepeatNet were referenced for Diginetica and the other two datasets, respectively, so one doesn't remove immediate repeats while the other does. The third reason is that Diginetica is a relatively small dataset, so a fewer number of filtering steps are applied, otherwise, Diginetica would have much fewer items. And therefore, short sessions were actually removed in Gowalla and Lastfm, by the filter_until_all_long_and_freq function.

Rosarubu commented 3 years ago

Thanks again. So basically, the process procedures followed previous studies SRGNN and RepeatNet. For the third reason you mentioned, if you filter the items and sessions only once, there will still be infrequent items and short sessions in dataset. In my view, it seems that the dataset still need to be cleaned. I am curious about how different precessing will affect the performance of model.

twchen commented 3 years ago

We can always filter short sessions at the last step, so there are no short sessions left. But there are indeed some infrequent items. Intuitively, that would harm the performance because the itemset size is larger (making it harder for the prediction to hit at top K) and there are more cold-start items.

Rosarubu commented 3 years ago

Agreed. Feels like if we apply the filtering recursively, models could have better performance. Then my next question is, why don't you filter the infrequent items if it harm the performance?

twchen commented 3 years ago

I just found that if we recursively filter Diginetica, the numbers of sessions and items are not much smaller. Stats of the recursive approach:

Training set
No. of Clicks: 901370
No. of Sessions: 187923
No. of Items: 41607
Test set
No. of Clicks: 75846
No. of Sessions: 15902
No. of Items: 20724

Stats of the current approach:

Training set
No. of Clicks: 905471
No. of Sessions: 188636
No. of Items: 42596
Test set
No. of Clicks: 76149
No. of Sessions: 15955
No. of Items: 20936

So I remembered it wrong. I guess I merely wanted to follow the previous work at that time.

Rosarubu commented 3 years ago

Hi, I run the LESSR model using dataset of recursive approach, and the performance is MRR@20:18.298%, and HR@20 : 52.743%. But I didin't try on other baseline models. Meanwhile, if I want to do the follow-up research, feels like I should use the same data precesses. Am I right ?

twchen commented 3 years ago

It is not necessary. If you use the data preprocessing procedures of previous works, then you don't need to run the models again and can simply report the results in the papers of previous works. But you can also use your own data preprocessing procedures, you just need to run the models on your datasets.

Rosarubu commented 3 years ago

That make sense. Thanks for your help. I will close the issue.