pykt-team / pykt-toolkit

pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models
https://pykt.org
MIT License
194 stars 53 forks source link

Problems with the assistment2017 dataset #124

Closed linchuanghong closed 3 months ago

linchuanghong commented 11 months ago

The assistment2017 dataset (anonymized_full_release_competition_dataset.csv) does not include concepts. Why does the data set generated after "data_preprocess.py" have concepts?

sonyawong commented 11 months ago

The assistment2017 dataset (anonymized_full_release_competition_dataset.csv) does not include concepts. Why does the data set generated after "data_preprocess.py" have concepts?

Hi, the original datasheet (anonymized_full_release_competition_dataset.csv) of assistment2017 includes "skill" which can be treated as corresponding concepts.

linchuanghong commented 11 months ago

Hi,But the skill in the raw data table is not an id(not a number of type int, but the name of the skill)? Why is concept an id of type int after preprocessing?

sonyawong commented 11 months ago

During the data preprocessing, we will map the question, kc contents, etc into new ids. As seen in pykt-toolkit/pykt/preprocess /split_datasets.py line471.

linchuanghong commented 11 months ago

Thanks for reply!

linchuanghong commented 11 months ago

Are there duplicate sequences in “train_valid_sequences” and “train_valid_quelevel” data sets?

linchuanghong commented 11 months ago

Can you explain the difference between the preprocessed data files?

------------------ 原始邮件 ------------------ 发件人: "pykt-team/pykt-toolkit" @.>; 发送时间: 2023年7月21日(星期五) 下午2:56 @.>; @.**@.>; 主题: Re: [pykt-team/pykt-toolkit] Problems with the assistment2017 dataset (Issue #124)

During the data preprocessing, we will map the question, kc contents, etc into new ids. As seen in pykt-toolkit/pykt/preprocess /split_datasets.py line471.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

sonyawong commented 11 months ago

Can you explain the difference between the preprocessed data files? ------------------ 原始邮件 ------------------ 发件人: "pykt-team/pykt-toolkit" @.>; 发送时间: 2023年7月21日(星期五) 下午2:56 @.>; @.**@.>; 主题: Re: [pykt-team/pykt-toolkit] Problems with the assistment2017 dataset (Issue #124) During the data preprocessing, we will map the question, kc contents, etc into new ids. As seen in pykt-toolkit/pykt/preprocess /split_datasets.py line471. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Thank you for your interest in our work. After data preprocessing, we get train_valid.csv and train_valid_sequences.csv which are the samples before and after truncation respectively. The filename with "quelevel" are the data files for question level based KT models such as iekt, qikt and lpkt. For the testing set, we have additional window files which use the nearest N historical interactions to predict the student performance on the next question.

linchuanghong commented 11 months ago

How do I change the Embedding size of the model? Why does it still not work when I modify the corresponding model in the kt_config.json file?

sonyawong commented 11 months ago

How do I change the Embedding size of the model? Why does it still not work when I modify the corresponding model in the kt_config.json file?

Sorry for the late reply. Can you provide the related modified codes about the seqlen ?

linchuanghong commented 11 months ago

I have already clarified this issue, thank you for your reply. Now I want to know what max_concepts=7 means in the algebra2005 dataset in the data_config.py file?

sonyawong commented 11 months ago

Basically, there are some questions associated with multiple knowledge concepts (KCs) in educational datasets. Hence, we calculate the largest number of KCs of a question in each dataset denoted as "max_concepts".