Scraping from bbdl - Githubissues

samjett247 commented 5 years ago

Hey Zach and Joe, I had a little extra time this past few weeks and I wanted to look into the scraping/ocr process a little further. I was trying to learn more about how Joe wrote it and to get more experience with both working with an existing codebase and how to write a scraper and parser. Along the way, I figured out and implemented a few optimizations:

Using HTML headers (h1, h2, etc.) to find the correct pdfs during web crawling
Using an OCR package called Tika in lieu of the convert to ImageMagick then run Tesseract OCR approach. Basically instead of converting the pdfs to images, Tika just reads the text straight from the pdfs. I found out that this helped to eliminate some of the bugs we saw previously with the letter "eye" being recognized as a pipe (|) and similar stuff.
Using Regex heavily within the data parsing to separate and filter strings. Since the text from Tika was sorted differently than the OCR text, I had to rework a lot of the parser, but I think I got everything figured out.
Added a small evaluation to the parser, showing we were able to successfully parse 96.8% of individual pages from the pdfs, and a lot of the pages we couldn't parse were just blank pages.

I kept a lot of the macrostructure of the program, especially the organization of pdf files and the pdf splitter, the web crawler framework, the error-handling during parsing, the multiprocessing for parsing, and the upload to Mongo. I put the data from the run, including all colleges for years 2010-2019, into ocr-db-v1 in Mongo.

Planning to phase this dataset into use on the backend over the next week; Plan to run into a few kinks here but hopefully can get this bigger dataset into use in our app. I've got a few more changes I want to make - modifying README and adding more documentation - but wanted to get you guys review here.

Thanks SJ

zachschuermann commented 5 years ago

Good for review now? @samjett247

samjett247 commented 5 years ago

Good for review now? @samjett247

I'd say no @schuermannator . I found a bug in the way I was getting the question ratings and I need a little more time to go over it. If you've got time to work on something, try to setup stuff with GraphQL on backend. I will have this scraping/parsing done, the data aggregated, and the site running on the more recent reviews by the end of the coming long weekend.

samjett247 commented 5 years ago

Update: I updated Readme and added some more documentation. Moving to backend to adjust scraper and

Good for review now? @samjett247

I'd say no @schuermannator . I found a bug in the way I was getting the question ratings and I need a little more time to go over it. If you've got time to work on something, try to setup stuff with GraphQL on backend. I will have this scraping/parsing done, the data aggregated, and the site running on the more recent reviews by the end of the coming long weekend.

Good to go @schuermannator

samjett247 commented 5 years ago

See README for stats Re. Scraping efficiency.

stev-ou / review_ocr

Scraping from bbdl #5