Closed stumash closed 7 years ago
@PeterGhimself :
About deleting your scraper: I'll do it. It's all in that folder webscraping/node
so I'll just delete the webscraping/node
folder and leave the webscraping/r
one.
About the two json files, aeroeng_document.json
and aeroeng_full-course-info.json
. <programname>_document.json
is the json of the info after all the regex parsing, the "finished" json, whereas <programname>_full-course-info.json
contains json objects which each only have one field, the full text string parsed about that course.
If anything is confusing about what I just said feel free to ask. Also, I'll be changing the filenames from <programname>_document.json
to <programname>_collection.json
since I think it's more appropriate.
I made a seperate folder just for my webscraper under node called courseDataScraper So you should get rid of specifically webscraping/node/courseDataScraper/dataScraper.js and webscraping/node/courseDataScraper/newDataScraper.js
Nice pull request dude these changes look good save for what Peter mentioned and hopefully the fact that .Rhistory showed up in these diffs was just an intermittent issue. I'm gunna go ahead and remove Peter's files
The course data is being scraped into json files by
scrape-course-data.r
. Things to note:JSON structure If the scraped page doesn't provide a value for any of a course's data fields, like "lab.hours" for example, then "lab.hours" will simply not be a key in the JSON for that course's info.
scrape sources incomplete
scrape-course-data.r
gets its list of pages to scrapecourseinfo-data-sources.txt
, which contains all info necessary for scraping our data sources. The file is formatted in groups of three lines: the first is the name of the program (eg "aero-eng"), the second the url that holds the info, the third is the css selector of the containing element to all the course info boxes. The problem is that `courseinfo-data-sources.txt' has only one courseinfo three-liner at the moment. We need more.mongodb storage We haven't written the node to do it yet, but each course info json will be a 'document' and we will have one large 'collection' of these documents called
course-info
or something. next step is to write these files.cron job The
r
script needs to run as a cron job.