otomist / uplanner

Online class planner for Umass Amherst students.
2 stars 1 forks source link

Scrape classes #1

Open otomist opened 6 years ago

otomist commented 6 years ago

I'm going to work on this

kngan43 commented 6 years ago

I was able to scrape all course information from Spire. This is done primarily using two frameworks interchangeably:

  1. Selenium
  2. Scrapy

The advantages of using both the frameworks are discussed in this link. After logging into spire, the pages are loaded using javascript and Selenium is used to automate the browser to navigate through the website. Then Scrapy's built in support is used to extract data from HTML sources.

Once we have logged into spire. Navigating though through the website is fairly simple as we only need to search for the web element of the navigation buttons and click on them. Note that a wait function is needed to be called after each click before searching for an element on the next page.

Once we get to the search results, we limit the search based on the term, major, university and start the search. We are then brought to the search results page consisting of a list of courses, all of its sections and some basic information about each of those.

In SpireSpider.py, I have scraped the titles of each course without its sections and some basic information about the course. In another script, SectionSpider.py, I scraped each section of the courses with its information.

To scrape each course section, we need to click on each corresponding link, get page source, and then press the view search results button. Note that selenium saves the session with the web element and can iterating through a list of links will not work after refreshing the page. This means that we will need search for the web-element for the next link every time after going back to search results.