stumash / CoursePlanner

http://conucourseplanner.online
MIT License
5 stars 3 forks source link

Fix course info scraper 84 #91

Closed stumash closed 7 years ago

stumash commented 7 years ago

2 changes to regexes

First Change

The course.info.header.rgx was: [A-Z]{4} [0-9]{3}[[:space:]]+?[A-Z][a-z]+., but is now: [A-Z]{4} [0-9]{3}[[:space:]]+?(\\(also listed as [^)]*\\))?[A-Z][a-z]+.

We use this regex to identify the start of a single course's information. Some course's information starts with something like: COMP 101 (also listed as SOEN 101) Intro. to Programming instead of: COMP 101 Intro. to Programming.

Second change

The course.description.rgx was: .*?(?=(Lecture|Tutorial|Laboratory|\nNOTE|$)), but is now: (.*?)(Lecture|Tutorial|Laboratory|NOTE|$)

The previous regex was essentially broken and was trying to achieve the result of the new one. The new regex will match the entire string up until the first occurence of either Lecture,Tutorial,Laboratory, NOTE, or $ (end of string).

resolves #84

stumash commented 7 years ago

Well, the script doesn't output anything new, in the sense that stdout is the same as it was. However, with respect to the course-info json it creates, there are some slight differeneces. There were some courses whose entire info was being absorbed by the description of the previous course. This was fixed by the first change. So I guess now there are a few courses that are are now being scraped as separate courses where before all their info was part of some other course's description. Also, the description property of some courses was also including the NOTE/Lecture/Tutotorial/Laboratory information which was fixed by the second change.