robbi5 / kleineanfragen

Collecting kleine Anfragen from Parlamentsdokumentationssystemen for easy search- and linkability
https://kleineanfragen.de
MIT License
43 stars 9 forks source link

tika vs pdftotext #101

Closed michamilz closed 7 years ago

michamilz commented 8 years ago

Could you please explain why you choose tika for text extraction from pdf files. pdftotext is a lot faster.

robbi5 commented 8 years ago

Currently simply because tika has a http-API available. We're using http://givemetext.okfnlabs.org as a hosted tika instance. I also experimented with pdftotext/pdftohtml, especially for the table recognition (see #96), but tika also includes tesseract for OCR-ing the scanned pdfs we get from some federal states.

michamilz commented 8 years ago

Thank you. Good to know this service.

For a similar website i use a multistep extraction. First try is always pdftotext. If pdftotext returns an empty textstring i run tesseract. Files from Word and Excel were converted using uniconv before running pdftotext. This works good for almost all files.