suderoy / PREREQ-IAAI-19

Inferring Concept Prerequisite Relations from Online Educational Resources (IAAI-19)
https://arxiv.org/abs/1811.12640
GNU General Public License v3.0
26 stars 11 forks source link

Custom dataset #4

Open mhoangvslev opened 3 years ago

mhoangvslev commented 3 years ago

I am struggling to build a dataset that is compatible to your requirements.

suderoy commented 3 years ago

Hi,

The paper assumes that you have a predefined concept space which are the concepts phrases. vocab.txt is the list of these phases. So it's is predefined.

You might create the vocab.txt by building ngrams from your texts. They may give a good candidate list of concepts but all of them may not be concepts of interest. You can think about but shinguard way of coming up with one. :)

About the other files as it is mentioned in the me,

-

cs_edges.csv: There are course prerequisite information. Each line "," represents is a prerequisite for

. , are the same course IDs you have in cs_courses.csv - *cs_preqs.csv*: These are concept prerequisite pairs. Each line "," represents the prerequisite relationship. ConceptA, ConceptB are concepts from the concept space, that is from the vocab.txt. All the best! Sudeshna On Wed, Dec 23, 2020, 8:28 AM Minh-Hoang DANG wrote: > I am struggling to build a dataset that is compatible to your requirements. > > - How do you generate vocab.txt? > - How do you generate cs_preq.csv and cs_edges.csv automatically? > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > , or unsubscribe > > . >
mhoangvslev commented 3 years ago

Thank you for your reply! The concept words in vocab.txt are extracted using BERT/ROBERTA model on each course's summary. Concepts in this case are simply keywords/keyphrases. We extracted groups of fixed length and use Max Sum Similarity / Maximal Marginal Relevance with Cosine Similarity as distance to diversify the results.

Course Concept 1 Concept 2 Concept 3 Concept 4 Concept 5
HIST 234 disease impact society sars swine flu theory disease development swine flu leading hiv aids
RLST 152 importance new testament testament theological themes early christianity ies christianity analyzing literature theological appropriation new
PSYC 123 encompasses study eating eating affects health eating disorders global influence food agriculture food problems politics
ECON 251 extension economic equilibrium studied extension economic separating financial world financial world rest hedge funds
PSYC 110 study thought behavior love lust hunger tickle course brain break illness apes learn sign
HIST 116 converting british colonists revolution entailed past america victory revolution minds people british colonists american
CLCV 205 translation works modern greek classical period students greek civilization students read original
AMST 246 course contain graphic language users disturbing warning lectures hemingway fitzgerald hemingway fitzgerald faulkner
EEB 122 students beginning study environment discusses major biologists accessible yale yale college undergraduates
ENGL 300 century literary lectures twentieth century literary readings explicate appropriate philosophical social perspectives

And the vocab.txt only contains the concept words.