opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
978 stars 169 forks source link

Correct recognized OCR data missing in search index #473

Open playbackandrewind opened 9 months ago

playbackandrewind commented 9 months ago

I have some .pdf files where the OCR recognition in graphics works perfectly, and the recognized text is also displayed correctly in the search results in the OCR tab, but I cannot find this text or its contents in the search index itself.

Does anyone have an idea why the OCR text does not appear in the search index?

The extracted text tab only contains very poorly recognized text, e.g. "tems Ltg Am Rohiance 3 5S300 WetterCar" "Invoice 12345 6 AV"

In the OCR tab the text is correctly recognized: "Car Systems Ltd Am Rohlande 3 58300 Wetter" "Invoice 123456 /W"

A search for "123456", for example, returns no results. I'm a bit at a loss right now.

mosea3 commented 9 months ago

Hi there, OSS takes the filename and metadata it directly into the index, but leaves the OCR data to be added later. Thats sone by Apache Tika.  Try using command line to index manually single files see if Tika is at HTTP Error 500.  i made the experience that the service hangs up on processing too much at the time. Also when low on disk, it stops adding OCR.  Furthermore, there is a parameter somewhere where you can disable double OCR, if you have a better calibrated OCR solution beforehand and then it takes the original OCRed PDF. By default it takes Google Tesseract in english language.  Make sure you set the ocr language to what your document content language is.  I use Chronoscan with a mix of Tesseract and Nuance.  It avoids unnecessary tokenization (the extra spaces).  Best regards Andy