Duplicate/inconsistent records with same user id and timestamp

riiid / ednet

EdNet is the dataset of all student-system interactions collected over 2 years by Santa, a multi-platform AI tutoring service with more than 780K users in Korea available through Android, iOS and web.

249 stars 52 forks source link

Duplicate/inconsistent records with same user id and timestamp #6

Open xiaoqtcd opened 2 years ago

xiaoqtcd commented 2 years ago

Hi. I found that there are many duplicate records with same user id and timestamp in KT1 and KT3. For example, for user u1, in dataset KT1, there are two records with same timestamp 1567140388553 as below: 1567140388553,219,q10649,a,57500 1567140388553,219,q10648,b,57500

Another example in KT3 for user u1 with timestamp 1567115277665: 1567115277665,respond,q4790,sprint,b,mobile 1567115277665,respond,q4790,sprint,b,mobile

The first example is very confusing because of different user responses for different questions. Moreover, it seems that it is not possible to reconstruct records in KT1 with data in KT3, due to the inconsistent timestamp recorded. I am wondering whether there are some clear issues in the dataset. Is there any way to get a cleaner version? Many thanks!

kwonmha commented 2 years ago

Hello, @xiaoqtcd I'm trying to reproduce Saint model with KT1 dataset and got worse AUC compared to other papers like LPKT, SAINT+, SAINT. As my code worked fine with kaggle riiid dataset, I guess my results is caused by unclean dataset state. How are your AUC or ACC with KT1 dataset? Are they good enough?

xiaoqtcd commented 2 years ago

Hi @kwonmha , I am doing mostly unsupervised learning at the moment, so don't have results of AUC, ACC. But I did some data analysis, and found out that there are many problems inside the dataset. What is LPKT? Are you using the code for SAINT, SAINT+ provided by riiid? I found out it's actually not that straightforward even to reconstruct Kaggle Riiid dataset's format with the raw Ednet dataset. Could you explain a bit how you did that?

kwonmha commented 2 years ago

Hi, @xiaoqtcd LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition. And modified codes to deal with KT1 dataset instead of the dataset for competition. I didn't reconstruct KT1 into Riiid format. I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

xiaoqtcd commented 2 years ago

Hi, @xiaoqtcd LPKT is the model proposed in "Learning Process-consistent Knowledge Tracing"(KDD '21)

I used SAINT models implemented from the participants of Kaggle Riiid competition. And modified codes to deal with KT1 dataset instead of the dataset for competition. I didn't reconstruct KT1 into Riiid format. I think they are similar so that it only requires few modification on code to put KT1 data into SAINT for kaggle dataset(selecting columns or compare answers if its correct or not).

Hi @kwonmha , thanks a lot for pointing out the paper for LPKT. It's an interesting one. But notice that, in Kaggle challenge dataset, prior_question_had_explanation and prior_question_elapsed_time are known. While in Ednet, I think, there is a need to reconstruct with KT1 and KT3 together. task_container_id needs to be reconstructed as well. I am not sure about whether cleaning EdNet plays an important part for the accuracy. But I believe reconstructing the data format correctly is very important.

kwonmha commented 2 years ago

@xiaoqtcd As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format. I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

xiaoqtcd commented 2 years ago

@xiaoqtcd As SAINT model doesn't take prior_question_had_explanation, prior_question_elapsed_time as input, I didn't try to reconstruct KT1 into kaggle format. I didn't use them while testing kaggle dataset with SAINT and don't want to use them while testing KT1 data.

I see. I am not familliar with SAINT but for SAINT+, it seems to me that there are needs to reorganize the temporal info and some other info to prepare for embedding used in the model. So I thought you were doing it. There are many different versions of implementations for SAINT+ on Kaggle. It seems to me the authors from Riiid are also on Kaggle as well. We can connect on Kaggle and have more discussion if you'd like.