stumash / CoursePlanner

http://conucourseplanner.online
MIT License
5 stars 3 forks source link

Course data scraper #25

Closed stumash closed 7 years ago

stumash commented 7 years ago

The course data is being scraped into json files by scrape-course-data.r. Things to note:

JSON structure If the scraped page doesn't provide a value for any of a course's data fields, like "lab.hours" for example, then "lab.hours" will simply not be a key in the JSON for that course's info.

scrape sources incomplete scrape-course-data.r gets its list of pages to scrape courseinfo-data-sources.txt, which contains all info necessary for scraping our data sources. The file is formatted in groups of three lines: the first is the name of the program (eg "aero-eng"), the second the url that holds the info, the third is the css selector of the containing element to all the course info boxes. The problem is that `courseinfo-data-sources.txt' has only one courseinfo three-liner at the moment. We need more.

mongodb storage We haven't written the node to do it yet, but each course info json will be a 'document' and we will have one large 'collection' of these documents called course-info or something. next step is to write these files.

cron job The r script needs to run as a cron job.

stumash commented 7 years ago

@PeterGhimself :

About deleting your scraper: I'll do it. It's all in that folder webscraping/node so I'll just delete the webscraping/node folder and leave the webscraping/r one. About the two json files, aeroeng_document.json and aeroeng_full-course-info.json. <programname>_document.json is the json of the info after all the regex parsing, the "finished" json, whereas <programname>_full-course-info.json contains json objects which each only have one field, the full text string parsed about that course.

If anything is confusing about what I just said feel free to ask. Also, I'll be changing the filenames from <programname>_document.json to <programname>_collection.json since I think it's more appropriate.

PeterGhimself commented 7 years ago

I made a seperate folder just for my webscraper under node called courseDataScraper So you should get rid of specifically webscraping/node/courseDataScraper/dataScraper.js and webscraping/node/courseDataScraper/newDataScraper.js

Davidster commented 7 years ago

Nice pull request dude these changes look good save for what Peter mentioned and hopefully the fact that .Rhistory showed up in these diffs was just an intermittent issue. I'm gunna go ahead and remove Peter's files