OCR all pdfs that come in to make scans searchable

mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction

https://alaveteli.org

Other

389 stars 195 forks source link

OCR all pdfs that come in to make scans searchable #377

Open robinhouston opened 12 years ago

robinhouston commented 12 years ago

[Imported from https://github.com/mysociety/whatdotheyknow/issues/28, reported by samsmith]

Shell script and code seems to be here: http://blog.konradvoelkel.de/2010/01/linux-ocr-and-pdf-problem-solved/

there are many other ways to do it, but the sending of a scan of a printed document is extremely common in responses, and currently that content is not exposed to either WDTK search, or search engines.

Later comment by samsmith: this may be better as a services.mySociety.org type thing; as then it could be used for all the "a copy has been placed in the library" documents that TWFY sees.

crowbot commented 9 years ago

More recent blog post: http://www.konradvoelkel.com/2013/03/scan-to-pdfa/

petterreinholdtsen commented 9 years ago

The norwegian public archives are currently using lots of pictures of paper documents in PDF format. OCR will be needed to make the content of documents like <URL https://www.mimesbronn.no/request/18/response/65/attach/html/2/attachment.pdf.html > available for those subscribing to keywords.

garethrees commented 7 years ago

Free OCR API and Online OCR https://ocr.space/ (via https://addons.mozilla.org/en-us/firefox/addon/copyfish-ocr-software/)

stevenday commented 7 years ago

FYI - we already have docsplit and tesseract running on the servers too, for https://www.patentoppositions.org/

petterreinholdtsen commented 7 years ago

<URL: http://www.tobias-elze.de/pdfsandwich/ > could be another useful alternative. Have not tested it myself, but it claim to add an hidden text layer to the original PDF using OCR, similar to some scanners.

-- Happy hacking Petter Reinholdtsen