Providing real dataset run_dien toy example. DIEN implementation questions.

Describe the question(问题描述)

As run_dien.py has randomly generated data just to run, it is hard to grasp exact input data format. Can you provide small fraction of some real data run_dien? like with a small portion of amazon dataset or so.

Additional questions:

Q1. For user behavior sequence, I just need to build one final sequence right? For example, I dont make sequences like this for one user as below where I make sequence increasing one by one until it reaches the last sequence item,

For 4 item sequence,
u1: [5]
u1: [5, 10]
u1: [5, 10, 52] 
u1: [5, 10, 52, 7]
u2: ...

But instead I should just make user behavior data as merely one sequence per user as below. Am I right?

u1: [5, 10, 52, 7]
u2: [1, 100, 26, 79]
...

Q2. Is Y value here binary ctr value [0, 1] meaning the target ad is clicked or not clicked?

Q3. If correct, There should be target ad or item after user sequence behavior I guess. But we only have user data, gender and item_id, cate_id and so on in the run_dien example. Where is the last target ad(or ITEM) feature CTR (click or not click) prediction will be made for?

Q4. If my system is not for target ad but target item to guess after user sequence behavior, should we just take out last item of user behavior sequence and make it as a new feature like 'target item' as one of input data and just feed into the model? But I do not see any 'target item (or ad)' thing in run_dien.py.

Q5. If my target item feature is made out of the sequence behavior feature, I think y has to be all 1 because if I have a target item, it means they already clicked the target item. in DeepFM case, I did negative sampling so I solved this problem but in DIEN, negative sampling is done in user sequence behavior to supervise GRU. So I do not know how to have 0 y values for not clicked. Should I make another negative sampling just like in DeepFM not only in the sequence behavior but also in input data level (not in sequence behavior level which is inside input data) by generating random target item and then set y as 0? If that is the case, How should I create positive sequence behavior and negative sequence behavior for negative sampled target item ones?

This is the data I have below. Only 1:1 negative sampling for sequence behavior is done. 스크린샷, 2022-03-15 17-00-07

If I make 1:1 negative sampling by generating random target item, then my data will be doubled from 18314 to 36628. Then I can make y in [0,1] both. but as i mentioned, how should I deal with hist_item_id and neg_hist_item_id for negatively sampled (y = 0) ones? please enlighten me in this.

Q6. What is 'pay_score' in run_dien example?

Q7. Should I make negative user behavior sequence manually like you did in the run_dien example?

Q8. negative behavior sequence for each user should have same length as positive behavior sequence?

shenweichen / DeepCTR-Torch

Providing real dataset run_dien toy example. DIEN implementation questions. #238

As run_dien.py has randomly generated data just to run, it is hard to grasp exact input data format. Can you provide small fraction of some real data run_dien? like with a small portion of amazon dataset or so.