Integrate OCRing where needed

konklone commented 10 years ago

I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have tesseract for OCRing of some reports.

I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:

http://www.justice.gov/oig/reports/2014/s140827.pdf

It's a great report, and has been getting news coverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.

divergentdave commented 8 years ago

As seen in 18F's blog today, 18F/doc_processing_toolkit handles both text extraction and OCRing. This could work for our purposes, though we should make it configurable, for those who don't want to set up Apache Tika and the like.

divergentdave commented 8 years ago

Case in point: https://www.si.edu/Content/OIG/Misc/FY16_CSA.pdf has two accessible words in it, "Appendix A".

unitedstates / inspectors-general

Integrate OCRing where needed #163