Open robinhouston opened 12 years ago
More recent blog post: http://www.konradvoelkel.com/2013/03/scan-to-pdfa/
The norwegian public archives are currently using lots of pictures of paper documents in PDF format. OCR will be needed to make the content of documents like <URL https://www.mimesbronn.no/request/18/response/65/attach/html/2/attachment.pdf.html > available for those subscribing to keywords.
Free OCR API and Online OCR https://ocr.space/ (via https://addons.mozilla.org/en-us/firefox/addon/copyfish-ocr-software/)
FYI - we already have docsplit and tesseract running on the servers too, for https://www.patentoppositions.org/
<URL: http://www.tobias-elze.de/pdfsandwich/ > could be another useful alternative. Have not tested it myself, but it claim to add an hidden text layer to the original PDF using OCR, similar to some scanners.
-- Happy hacking Petter Reinholdtsen
[Imported from https://github.com/mysociety/whatdotheyknow/issues/28, reported by samsmith]
Shell script and code seems to be here: http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/
there are many other ways to do it, but the sending of a scan of a printed document is extremely common in responses, and currently that content is not exposed to either WDTK search, or search engines.
Later comment by samsmith: this may be better as a services.mySociety.org type thing; as then it could be used for all the "a copy has been placed in the library" documents that TWFY sees.