[FEATURE] Catalog Major Scraper

TrevorBrunette commented 2 years ago

Scrape the baccalaureate programs from http://catalog.rpi.edu/content.php?catoid=22&navoid=542 Follow each link to the major program and scrape the description HTML (first innermost ) and the requirements HTML (second innermost ).

{
  "Baccalaureate": {
    "Architecture": {
      "description": "<raw description HTML>",
      "requirements": "<raw requirements HTML>"
    }
  }
}

TrevorBrunette commented 2 years ago

Now that we have this general scraper, we need to make some modifications. In the raw description HTML, you must remove the <table class="table_default"> and <p class="acalog-breadcrumb acalog-highlight-ignore"> from any description they exist within.

You will also need to write a parser for the majors. This will be a large assignment. First you will want to get the HTML contents of "First Year", "Second Year", and so on each time it has "Year" in the header. Create an array with however many "Year"s there were and parse each of those "Year"s into the array. For now, you can simply save the inner HTML content.

After the "Years" parsing, you will need to parse the additional information after the "Year"s into a dictionary with the header text as the key and the following HTML (besides the <hr> directly after the header) as the value. All of the sections here will require you to go into them to get the desired HTML. Store the header text as the key. Store the raw HTML from within the <div> which is after the <hr> and before the end of the <div> as the value.

There is sometimes (but not always) a section that does not have a header within it. You will know if you have any of these if you have a <div class="custom_leftpad_20"> as the outermost <div> after the courses. The "Options" section in the page below is an example of this. It will be followed by a <div> containing the raw HTML we want. It is possible for there to be content we need in both the first "header" div and the second "content" div, so you will need to retrieve the content of the "header" div from after the <hr> and before the end of the <div>, and then append to it the content of the "content" <div>.

Using the Computer Science page (http://catalog.rpi.edu/preview_program.php?catoid=22&poid=5366&returnto=542) as an example, we should get:

{
  "Baccalaureate": {
    "Architecture": {
      "description": "<raw description HTML>",
      "years": [
        "<inner HTML of 'First Year'>",
        "<inner HTML of 'Second Year'>",
        "<inner HTML of 'Third Year'>",
        "<inner HTML of 'Fourth Year'>"
        ],
      "other-content": {
        "Options": "<options HTML><section-after-options HTML>",
        "Capstone": "<capstone HTML>",
        "Transfer Credit Policy": "<transfer-credit-policy HTML>",
        "Footnotes": "<footnotes HTML>"
      }
    }
  }
}

TrevorBrunette commented 2 years ago

More work to be done with parsing

rpi-crisis / scraper

[FEATURE] Catalog Major Scraper #9