Open konklone opened 10 years ago
As seen in 18F's blog today, 18F/doc_processing_toolkit handles both text extraction and OCRing. This could work for our purposes, though we should make it configurable, for those who don't want to set up Apache Tika and the like.
Case in point: https://www.si.edu/Content/OIG/Misc/FY16_CSA.pdf has two accessible words in it, "Appendix A".
I'm not sure the best path for detection of reports that need OCRing (perhaps through a flag set by the scraper), but we should have
tesseract
for OCRing of some reports.I'm motivated by this report by the FBI, where only the cover sheet has text. The FBI, at least, has a clear practice of image-izing redacted documents:
http://www.justice.gov/oig/reports/2014/s140827.pdf
It's a great report, and has been getting news coverage. I did some very brief experimentation with OCR parameters for another project, and the 300dpi 8 bit approach seemed good enough to me.