stev-ou / review_ocr

Using Google's Tesseract OCR to extract data from public PDFs
GNU General Public License v3.0
0 stars 1 forks source link

Scraping from bbdl #5

Closed samjett247 closed 4 years ago

samjett247 commented 5 years ago

Hey Zach and Joe, I had a little extra time this past few weeks and I wanted to look into the scraping/ocr process a little further. I was trying to learn more about how Joe wrote it and to get more experience with both working with an existing codebase and how to write a scraper and parser. Along the way, I figured out and implemented a few optimizations:

I kept a lot of the macrostructure of the program, especially the organization of pdf files and the pdf splitter, the web crawler framework, the error-handling during parsing, the multiprocessing for parsing, and the upload to Mongo. I put the data from the run, including all colleges for years 2010-2019, into ocr-db-v1 in Mongo.

Planning to phase this dataset into use on the backend over the next week; Plan to run into a few kinks here but hopefully can get this bigger dataset into use in our app. I've got a few more changes I want to make - modifying README and adding more documentation - but wanted to get you guys review here.

Thanks SJ

zachschuermann commented 5 years ago

Good for review now? @samjett247

samjett247 commented 5 years ago

Good for review now? @samjett247

I'd say no @schuermannator . I found a bug in the way I was getting the question ratings and I need a little more time to go over it. If you've got time to work on something, try to setup stuff with GraphQL on backend. I will have this scraping/parsing done, the data aggregated, and the site running on the more recent reviews by the end of the coming long weekend.

samjett247 commented 5 years ago

Update: I updated Readme and added some more documentation. Moving to backend to adjust scraper and

Good for review now? @samjett247

I'd say no @schuermannator . I found a bug in the way I was getting the question ratings and I need a little more time to go over it. If you've got time to work on something, try to setup stuff with GraphQL on backend. I will have this scraping/parsing done, the data aggregated, and the site running on the more recent reviews by the end of the coming long weekend.

Good to go @schuermannator

samjett247 commented 5 years ago

See README for stats Re. Scraping efficiency.