stev-ou / review_ocr

Using Google's Tesseract OCR to extract data from public PDFs
GNU General Public License v3.0
0 stars 1 forks source link

Scraped Data does not include anything more recent than 2013 #2

Open samjett247 opened 5 years ago

samjett247 commented 5 years ago

For the new instructor chip we built, we need to know how long the prof. has been at OU. But given the current data, we can only know 6 years back in the scraped dataset. Would this be a relatively easy fix to scrape all of the data @jlovoi ?

jlovoi commented 5 years ago

Yeah, I can go ahead a scrape all the data for sure, just hoping that there isn’t anything kooky and wacky and crazy about those PDFs.... I guess it wouldn’t be an issue anyway since it would only need to be checking for a single instance of the teachers name

Enviado desde mi iPhone

El jun. 13, 2019, a la(s) 16:05, Sam Jett notifications@github.com escribió:

For the new instructor chip we built, we need to know how long the prof. has been at OU. But given the current data, we can only know 6 years back in the scraped dataset. Would this be a relatively easy fix to scrape all of the data @jlovoi ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

samjett247 commented 5 years ago

@jlovoi Feel free to run this whenever you get a chance. Might be nice to get it done before job start? Just a thought. Let me know if I can help with anything :)

samjett247 commented 5 years ago

@jlovoi Do you think you would be able to run this scraper this weekend to get the larger dataset (from further back) and the entries from Fall 2018 + Spring 2019?

samjett247 commented 5 years ago

Addressed in #5