nusmodifications / nusmods

🏫 Official course planning platform for National University of Singapore.
https://nusmods.com
MIT License
586 stars 319 forks source link

The Exchange Centre: SEP/NOC module mapping tool [1269] #392

Closed ktnc985 closed 7 years ago

ktnc985 commented 7 years ago

NUSAtlas

Introduction:

The objective of this project is to make the module mapping process more convenient for students going for SEP/NOC.

Currently, students who wish to create a module mapping that has not been approved before have to go through a tedious manual mapping process. They need to look through a list of modules offered by partner universities (PUs) and identify a module that is most similar in terms of contents, contact manner, contact hours and assessments to the one offered in NUS.

Our solution is to simplify this process by using natural language processing to generate a similarity report based on the descriptions of modules listed on NUS and the PUs’ websites. Only modules with a similarity percentage of 50% and above will be displayed. Thereby narrowing down the number of modules the student has to look through to find a potential match.

Implementation of solution:

Gathering of data (not fully implemented during orbital phase)

The data required can be split into 3 categories:

  1. NUS module data
  2. PU module data
  3. Pre-approved mappings

We plan to obtain the data on NUS modules through the NUSMods API. For PU module data and data on Pre-approved mappings, the data would ideally be obtained through web crawlers. However, this part of the project will not be implemented during Orbital due to time constraints. In addition, we will only be considering module mapping of SoC modules to modules from NTU for similar reasons. Module data from NTU will be obtained through the Modify API. Pre-approved module mappings will be extracted manually from the SoC website initially. The process will be automated later on if time permits.

Processing of data

To create a basis of comparison between the NUS modules and the PU modules , we plan on using Latent Dirichlet Allocation(LDA) to generate topics based on word frequency from the modules description. The following python packages will be used:

Before the data can be analyzed it must first be cleaned. First, the text needs to be tokenized. This can be done using the nltk tokenization module.

Conjunctions and words like “the” don’t provide any significant meaning are removed from the tokenized text. The stop_words package provides a comprehensive list of these stopwords. With it, the stopwords can be removed by looping through our tokens and comparing each word to the list of stopwords.

Next, different forms of the same word like “run” and “running” are then reduced the their root form through a process called stemming. The Porter Stemmer Algorithm is one of the most widely used method of stemming and will be implemented by importing the Porter Stemmer module from the nltk package.

After the data is cleaned, we then generate a document-term matrix which tells us how frequently each word appears in the document. Once that is done, we can pass the data through the LDA model from the gensim package. The LDA model will output a list of topics. Each topic contains the most probable words to appear along with their probabilities. An example of this will be :

['0.054*fundament + 0.054*concept' , '0.070*problem + 0.070*solv' ]

The topic models of two modules can then be compared to calculate the percentage match between the two models. We will need to do some experimentation to come up with the most suitable method to calculate the percentage match.

Storing of data

We intend to store the data using MongoDB which is packaged with meteorJS. Sample json format:

    [{
      {
        "NUSModuleCode": "CS1010E",
        "PUModuleCode": "FE1008",
        "ModuleName": "Computing",
        "Syllabus": "<insert module description here>"
        “Link”: “http://www.cee.ntu.edu.sg/Students/Undergraduate/Curriculum/Pages/courses/FE1008.aspx”,
        "UniversityName": "Nanyang Technological University",
        “Region”: “Asia”,
        "Similarity": 50,
        "PrevMatch": true
      },
     ...
    }]

UI/UX (implemented using Blaze/MeteorJS)

Users can search for modules by module code or module name and filter the results by region and PUs. After the search, the site will display:

to allow the user to compare and choose the most suitable match.

Timeline:

Week No. Main Task Side Tasks Important deadlines
Week 1 (15th May) Design & Implementation of first iteration of front end UI/UX Pull and format data from NUS mods API and Modify API. Manually extract pre-approved mappings from Soc webpage.
Week 2 (22nd May) Design & Implementation of first iteration of front end UI/UX Pull and format data from NUS mods API and Modify API. Automate extracting pre-approved module mappings from SoC website
Week 3 (29th May) Design & Implementation of first iteration of front end UI/UX Pull and format data from NUS mods API and Modify API. Automate extracting pre-approved module mappings from SoC website Evaluation Milestone 1
Week 4 (5th June) Set up & integrate firebase database Make changes to UI/UX based on user feedback
Week 5 (12th June) Set up & integrate firebase database Make changes to UI/UX based on user feedback
Week 6 (19th June) Implement & Improve on NLP algorithm Make changes to UI/UX based on user feedback. Testing of system with dummy data
Week 7 (26th June) Implement & Improve on NLP algorithm Make changes to UI/UX based on user feedback. Testing of system with dummy data Evaluation Milestone 2
Week 8 (3rd July) Implement & Improve on NLP algorithm Make changes to UI/UX based on user feedback. If NLP algorithm is already producing acceptable results, begin integration into the full system and expand the system to support another faculty
Week 9 (10th July) Implement & Improve on NLP algorithm Minor tweaking of UI/UX. If NLP algorithm is already producing acceptable results, begin integration into the full system and expand the system to support another faculty
Week 10 (17th July) Integration & Testing of full system Minor tweaking of UI/UX. Expand system to support another faculty if time permits
Week 11(24th July) Integration & Testing of full system Minor tweaking of UI/UX. Expand system to support another faculty if time permits Evaluation Milestone 3
ngzhian commented 7 years ago

Cool idea!

This part: "Pre-approved module mappings will be extracted manually from the SoC website." worries me, because it means that every semester someone has to remember to do this. I would put "automate extracting pre-approved module mappings from SoC website" as one of the goal, that will be useful.

How about expanding to other faculties? I understand that web scraping partner university data is a stretch goal (although that will be cool), but I think this system can be designed in an extensible manner so as to support other faculties. Will be great if one of the goal is to support another faculty, say FASS?

I think the JSON format needs reworking:

  1. We already have NUS modules info in another JSON files, do you want to replicate that data in your schema?
  2. Having - in the key might not be the best idea. Also how would you implement search? Say I want to find all modules that CS1010 can be mapped to?
  3. How do you intend to generate your JSON file? Are you going to query the API for every call the client makes, or do you intend to run this as a schedule job? If scheduled, how frequent?
ktnc985 commented 7 years ago

Hi @ngzhian , thanks for the feedback!

Automating the process of extracting pre-approved module mappings is definitely one of the goals we have for this project, but since we plan to start with 1 PU for SoC which has only 13 pre-approved mappings, we decided to manually extract the data at the start. We'll add the automation part into our project outline though!

Expanding to other faculties is also something we do plan on doing as well. If everything goes smoothly, we'll definitely start working on adding in another faculty and maybe even another PU! We'll add this part in our outline as well.

We agree that the JSON part needs to be reworked. You're right that we don't really need to replicate the NUS modules info in our schema, so we'll be taking that part out. We're both relatively new to using databases, so we're still trying to figure out how exactly the search will be implemented. The rough outline of our original plan was to go through the mappings and return those that contained CS1010. We think a better way of doing this is to use CS1010 as a key that gives a list of all the mappings of CS1010. We're also still in the process of figuring out how exactly the JSON file will be generated. We intend to run this as a scheduled job once a month.

li-kai commented 7 years ago

Closing this since Orbital has ended. UI was very nice, and idea is very useful...so useful we want it for nusmods haha.

Project url: https://nusmodulus.herokuapp.com/

P.S. Welcome contributions back any time.