riiid / ednet

EdNet is the dataset of all student-system interactions collected over 2 years by Santa, a multi-platform AI tutoring service with more than 780K users in Korea available through Android, iOS and web.
258 stars 55 forks source link

EdNet

Paper : https://arxiv.org/abs/1912.03072

Leaderboard : Link

EdNet is the dataset of all student-system interactions collected over 2 years by Santa, a multi-platform AI tutoring service with more than 780K users in Korea available through Android, iOS and web.

Properties of EdNet

EdNet dataset contains various features of student actions such as which learning material he have consumed, response, how much time he have spent for solving a given question or reading through expert’s commentary. And EdNet have some properties which is introduced following.

1. Large scale

EdNet is composed of a total of 131,441,538 interactions collected from 784,309 students of Santa since 2017. Each student has generated 441.20 interactions while using Santa on average. EdNet, based on those interactions, makes researchers possible to access to a large-scale real-world ITS data. Moreover, Santa provides a total 13,169 problems and 1,021 lectures tagged with 293 types of skills, and each of them has been consumed 95,294,926 times and 601,805 times, respectively. To the best of our knowledge, this is the largest dataset in education available to the public in terms of the total number of students, interactions, and interaction types.

2. Diversity

EdNet offers the most diverse set of interactions among all existing ITS data. The set of behaviors directly related to learning is also richer than other datasets, as EdNet includes learning activities such as reading explanations and watching lectures not provided by others. Such diversity enables researchers to analyze students from various perspectives. For example, purchasing logs may help to analyze student's engagement for learning. Also, contents information table is provided separately.

3. Hierarchy

EdNet has a hierarchical structure of different data points. To provide various kinds of actions in a consistent and organized manner, EdNet offers the datasets in four different levels each named KT1, KT2, KT3 and KT4. As the level of the dataset increases, the number of actions and types of actions involved also increase. The details and descriptions of each dataset is described below.

4. Multi-platform

In the age where students have access to various devices spanning from personal computers to smartphones and AI speakers, it is inevitable for ITSs to offer the access from multiple platforms. Accordingly, Santa is a multi-platform system available in iOS, Android and Web and EdNet contains data points gathered from both mobile and desktop.This allows the study of AIEd models suited for future multi-platform ITSs, utilizing the data collected from different platforms in a consistent manner.

Dataset

As we said, there are four datasets named KT1, KT2, KT3, and KT4 with different extents. Here's common features of these datasets:

KT1

Download a .zip file from bit.ly/ednet_kt1

Specification
Size of the compressed file 1.2GB
Size of the uncompressed file 5.6GB
The number of files 784,309

Structure

KT1 consists of students' question-solving logs, which is the most basic and fundamental information that can be used by various deep-learning knowledge tracing models such as Deep Knowledge Tracing and Self-Attentive Knowledge Tracing. EdNet-KT1 is the record of Santa collected since Apr 18. 2017 following this question-response sequence format. A major property of EdNet is that the questions come in bundles. That is, a collection of questions sharing a common passage, picture or listening material. For example, questions of ID q2319, q2320 and q2321 may share the same reading passage. In this case, the questions are said to form a bundle and will be given to the student with corresponding shared material. When a bundle is given, a student have access to all the problems and has to respond all of them in order to complete the bundle.

Description

timestamp question_id bundle_id user_answer elapsed_time
1548996377530 48 q2844 d 47000
1548996378149 48 q2845 d 47000
1548996378665 48 q2846 d 47000
1548996671661 49 q4353 c 67000
1548996787866 50 q3944 a 54000

KT2

Download a .zip file from bit.ly/ednet-kt2

Specification
Size of the compressed file 0.6GB(555.8MB)
Size of the uncompressed file 3.1GB
The number of files 297,444

Structure

A major drawback of the question-response sequence format is that it cannot account for the inherent heterogeneity of students' actions. For example, a student may alternately select one of two answer choices before submitting his final answer, which possibly signals that he is unsure of either of the options. Due to the restriction of question-response format, a dataset following such format like EdNet-KT1 can't effectively represent such situation. To overcome this limitation, Santa have collected the full behavior of students since Aug. 27, 2018. As a result, the datasets EdNet-KT2, EdNet-KT3 and EdNet-KT4 of action sequences of each user are compiled. Each action represents a single unit of behavior made by a student in the Santa UI, such as watching a video lecture, choosing a response option, or reading a passage. By recording a student's behavior as-is, the datasets represent each student's behavior more accurately and allows AIEd models to incorporate finer details of learning history. EdNet-KT2, the simplest action-based dataset of EdNet, consists of the actions related to question-solving activities. Note that the features of KT1 can be fully recovered by the columns of KT2, and KT2 contains further information such as the study mode of student or the intermediate responses provided by student.

Description

Example

timestamp action_type item_id source user_answer platform
1358114668713 enter b4957 diagnosis mobile
1358114691713 respond q6425 diagnosis c mobile
1358114701104 respond q6425 diagnosis d mobile
1358114712364 submit b4957 diagnosis mobile
1358114729868 enter b5180 sprint mobile
1358114745592 respond q6815 sprint c mobile
1358114748023 respond q6816 sprint a mobile
1358114748781 respond q6814 sprint a mobile
1358114751032 submit b5180 sprint mobile

KT3

Download a .zip file from bit.ly/ednet-kt3

Specification
Size of the compressed file 0.8GB(762.8MB)
Size of the uncompressed file 4.3GB
The number of files 297,915

Structure

In Santa, a student may participate in various learning activities aside from solving questions, such as reading through experts' commentary on a question or watching lectures provided by the system. EdNet-KT3 incorporates such learning activities by adding the following actions to the EdNet-KT2 dataset. Such actions can be utilized by to infer the impact of learning activities to each student's knowledge state. For example, one may analyze the time each student have spent studying a given material by subtracting the timestamps of enter and quit actions and use this to study the effect of students' different learning behaviors.

Description

As we said above, explanations and lectures are added.

Example

timestamp action_type item_id source user_answer platform
1573364188664 enter b790 sprint mobile
1573364206572 respond q790 sprint b mobile
1573364209673 respond q790 sprint d mobile
1573364209710 submit b790 sprint mobile
1573364209745 enter e790 sprint mobile
1573364218306 quit e790 sprint mobile
1573364391205 enter l540 adaptive_offer mobile
1573364686796 quit l540 adaptive_offer mobile
1573364693793 enter b6191 adaptive_offer mobile
1573364702213 respond q8840 adaptive_offer c mobile
1573364705838 submit b6191 adaptive_offer mobile

KT4

Download a .zip file from bit.ly/ednet-kt4

Specification
Size of the compressed file 1.2GB
Size of the uncompressed file 6.4GB
The number of files 297,915

Structure

In EdNet-KT4, a complete list of actions collected by Santa is provided. In particular, the following types of actions are added to EdNet-KT3: erase_choice, undo_erase_choice, play_audio, pause_audio, play_video, pause_video, pay, refund, and enroll_coupon.

Description

Example

timestamp action_type item_id cursor_time source user_answer platform
1358114668713 pay p25 mobile
1358114691713 enter b878 sprint mobile
1358114701104 text_enter q878 sprint mobile
1358114712364 play_audio q878 0 sprint mobile
1358114729868 pause_audio q878 10000 sprint mobile
1358114745592 eliminate_choice q878 sprint a mobile
1358114748023 respond q878 sprint c mobile
1358114748781 submit b878 sprint mobile
1358114751032 enter e878 sprint mobile
1358114779211 play_audio e878 0 sprint mobile
1358114792300 pause_audio e878 8000 sprint mobile
1358114842195 quit e878 sprint mobile

Contents

Download a .zip file from bit.ly/ednet-content

Specification
Size of the compressed file 0.6MB
Size of the uncompressed file 0.1MB
The number of files 4

There are five types of contents that Santa serves to students: questions, lectures, payment items, coupons, and scores.

scores are not released yet. It will be released later.

Questions

Question information table contains 7 columns: question_id, bundle_id, explanation_id, correct_answer, part, tags, deployed_at.

Example
question_id bundle_id explanation_id correct_answer part tags deployed_at
q2319 b1707 e1707 a 3 179;53;183;184 1571279008033
q2320 b1707 e1707 d 3 52;183;184 1571279009205
q2321 b1707 e1707 d 3 52;183;184 1571279010285
q2322 b1708 e1708 b 3 52;183;184 1571279012823
q2323 b1708 e1708 c 3 179;52;182;184 1571279013890
q2324 b1708 e1708 d 3 52;183;184 1571279014989

Lectures

Lecture information table contains 5 columns: lecture_id, part, tags, video_length, deployed_at.

Note that some lectures' tags, video_lengths, and deployed_at are not available, and they are recorded as -1.

Example
lecture_id part tags video_length deployed_at
l805 5 99 230000 1570426256512
l855 0 203 253000 1570425118345
l1259 1 222 359000 1570424729123
l1260 1 220 487000 1570424738105
l1261 1 221 441000 1570424743162
l1262 1 223 587000 1570424748780

Payment items

Payment item information table contains 4 columns: payment_item_id, payment_type, duration, and number_of_questions.

payment_item_id payment_type duration number_of_bundles
p6 pass 15552000000 -1
p7 paygo -1 4000
p8 pass 10368000000 -1
p9 pass 31536000000 -1

Coupons

Coupon information table consists of 3 columns: coupon_id, coupon_type, and duration.

coupon_id coupon_type duration
c16 type-1 15552000000
c17 type-2 432000000
c18 type-3 1000000
c19 type-3 36000000

Scores

To be released.

Contact

If you have any questions or find any issues, please contact research@riiid.co.

License

The dataset is publicly released under Creative Commons Attribution-NonCommercial 4.0 International license for research purposes.