rpi-crisis / scraper

Web scrapers for the RCOS project Correcting Rensselaer's Insufferable SIS (CRISIS)
https://rpicrisis.org
3 stars 0 forks source link

[FEATURE] HASS Pathway Scraper and Documentation #15

Open TrevorBrunette opened 2 years ago

TrevorBrunette commented 2 years ago

We need a HASS Pathway scraper which will collect the HASS pathways from http://catalog.rpi.edu/index.php

THIS IS A TWO PERSON JOB and will require guidance. Please find someone to work on this closely with and check in with Trevor or Lily periodically and when deciding on making any changes to specifications.

Please comment your code - each method should have a description of what it does and what it returns, and any code that is difficult to understand should be commented to explain what it does.

The program must load this page and ensure that it is on the most recent catalog (of the current year). Then it must load the "Programs" page. From the "Programs" page, it will gather all of the links in the "Integrative Pathway" section. It will then go to each of those links and record the following information in the following output JSON:

{ 
  "<pathway name>": {
    "description": "<description text>",
    "requirements": "<raw requirements html>"
  },
  "<other pathway name>": {
  "description": "<other description text>",
  "requirements": "<other raw requirements html>"
  }
}

Using this page as an example (http://catalog.rpi.edu/preview_program.php?catoid=22&poid=5545&returnto=542)

The "pathway name" is whatever is in the <h1 id="acalog-page-title">

Economics 

The "description text" is all of the text within each <p> below the <p class="acalog-breadcrumb acalog-highlight-ignore">

Study different types of theories and statistical methods used by economists. Students are prepared to gain a broad understanding of how consumers, firms and governments make decisions, and their implications.
To complete this integrative pathway, students must choose a minimum of 12 credits as described: 

The "raw requirements html" is the content within <div class="custom_leftpad_20">

div class="acalog-core"><h2><a name="ChooseOneOfTheFollowing"></a><a name="chooseoneofthefollowing" id="core_34991"></a>Choose one of the following:</h2><hr><ul><li class="acalog-course"><span><a href="#" aria-expanded="false" onclick="showCourse('22', '45000',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~34991~;}'); return false;">ECON 1200 - Introductory Economics</a> <em>Credit Hours:</em> <em>4</em></span></li><li class="acalog-course"><span><a href="#" aria-expanded="false" onclick="showCourse('22', '43470',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~34991~;}'); return false;">IHSS 1200 - Principles of Economics</a> <em>Credit Hours:</em> <em>4</em></span></li></ul></div><div class="acalog-core"><h2><a name="ChooseOneOfTheFollowing"></a><a name="chooseoneofthefollowing" id="core_34992"></a>Choose one of the following:</h2><hr><ul>
    <li>2000-level ECON Elective <em>Credit Hours: 4</em></li>
</ul>
<ul><li class="acalog-course"><span><a href="#" aria-expanded="false" onclick="showCourse('22', '43471',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~34992~;}'); return false;">ECON 2010 - Intermediate Microeconomic Theory</a> <em>Credit Hours:</em> <em>4</em></span></li><li class="acalog-course"><span><a href="#" aria-expanded="false" onclick="showCourse('22', '43472',this, 'a:2:{s:8:~location~;s:7:~program~;s:4:~core~;s:5:~34992~;}'); return false;">ECON 2020 - Intermediate Macroeconomic Theory</a> <em>Credit Hours:</em> <em>4</em></span></li></ul></div><div class="acalog-core"><h2><a name="ChooseRemainingCreditsFrom"></a><a name="chooseremainingcreditsfrom" id="core_34993"></a>Choose remaining credits from:</h2><hr><ul>
    <li>Any 4000-level ECON course</li>
</ul>
</div><div class="acalog-core"><h2><a name="CompatibleMinor"></a><a name="compatibleminor" id="core_34994"></a>Compatible minor:</h2><hr><p><a href="preview_program.php?catoid=22&amp;poid=5380">Economics Minor</a><span style="display: none !important">&nbsp;</span>&nbsp;</p>
</div> 

After this task is completed, create a function which will parse the "raw requirements html" and turns it into useful information of the form

{
  "req1": {
    "num-required": 1
    "options": "ECON-1200 or IHSS-1200"
  }
  "req2": {
    "num-required": 1
    "options": "2000-level ECON Elective or ECON-2010 or ECON-2020"
  }
  "remaining": {
    "options": "Any 4000-level ECON course"
  }
  "other": [
    {
      "Compatible minor:": "Economics Minor"
    }
  ]
}

Please keep in mind that there can be any number of required courses in a req# section, and that there can be any number of req# sections in the HASS Pathway listing. "num-required" is the number of courses required from that section. Since it says "Choose one of the following", we must obtain from that the number 1 for req1, and then do the same process for req2. If you find the use of " or " in between possible options to be insufficient, you can choose another reasonable delimiter such as "|", " | ", or " OR ". If there is a course that is simply required for a pathway, the "num-required" should be set to 0 and the "options" should only list the course id of that course.

Also consider the possibility of there not being a "Choose remaining credits from the following:" section, and the number of credits listed in the text "with at least credits at the -level" which occasionally appears. It is not displayed above, but you will want to account for this with a "-credits" field with an integer value in the remaining category (i.e. "4000-credits": 8) to show that there is 8 credits at the 4000 level required.

The other section is for any remaining category titles and the text within them, usually this is just "Compatible minor:": "".

After the code for this is complete, instead of storing the "raw requirements html" in the initial output JSON, you should instead store the output of giving that "raw requirements html" to the parser function.