PDF text extraction not very reliable

janLo commented 5 years ago

I have a lot of PDF scans created using the "Scanbot" app. This app tries to put the OCRed text behind the actual scanned letters. This causes the text-extraction of files_fultextsearch to insert a lot of spaces where there are no spaces (there also no spaces if the text gets copied out of the PDF-Viewer integrated in NC).

The result is, that FTS is pretty useless as words are only found if you know in advance where the text extraction inserted such spaces.

Sadly I wasn't able to find the place in the app where the extraction is done to inestigate this further.

janLo commented 5 years ago

After a bit of digging it seems to me, that FTS completely relies on the text-extraction of the ingest-attachment plugin of Elasticsearch. This uses Apache Tika for the job which actually might be configureable through the AverageCharTolerance and SpacingTolerance properties.

It even has a interface where the configuration can be read from a property file: https://github.com/apache/tika/blob/cd51d93355bfadd39012f4fd99654ac9d94450dd/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java#L155

But sadly the elasticsearch plugin does not use this and only relies on the default values: https://github.com/elastic/elasticsearch/blob/master/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java#L82

So the only options I see is to report an issue to elasticsearch or to change the text extraction to something that can be influenced by NC itself.

janLo commented 5 years ago

FYI: https://github.com/elastic/elasticsearch/issues/36890

nextcloud / files_fulltextsearch

PDF text extraction not very reliable #29