openkm / document-management-system

OpenKM is a Open Source Document Management System
https://www.openkm.com/
GNU General Public License v2.0
700 stars 303 forks source link

PDF Text extraction fails on 6.3.12 #348

Closed aldemira closed 2 years ago

aldemira commented 2 years ago

I just reverted back to 6.3.9 and it works flawlessly. I tried rebuilding indexes etc. But I see errors that text etraction had failed. Hence the search doesn't produce anything at all. 6.3.9 works fine.

monkiki commented 2 years ago

I've checked and works fine. So, provide a sample PDF to test.

aldemira commented 2 years ago

OK, let's do this, I can't freshly install 6.3.12 now so I'll be closing this issue, whenever I can. I'll install a fresh copy and test it. Thanks.

aldemira commented 2 years ago

Sorry I've to reopen this issue now. I've just installed 6.3.12 from scratch (with docker-compose). And here are the logs I'm getting:

2022-09-27 11:25:00,105 [Thread-181] INFO c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=d5a22248-29ae-4d42-aadc-551b810049e4, docPath=/okm:root/Video/intro-linux.pdf, docVerUuid=22de99e1-cb51-42e4-a67f-ff3da8064686, date=Tue Sep 27 11:22:47 UTC 2022} 2022-09-27 11:25:00,854 [Thread-181] WARN c.o.extractor.CuneiformTextExtractor - Undefined OCR application 2022-09-27 11:25:00,855 [Thread-181] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/Video/intro-linux.pdf': Too few text extracted 2022-09-27 11:30:00,067 [Thread-208] INFO com.openkm.core.UserMailImporter - User mail importer activated 2022-09-27 11:30:00,085 [Thread-209] INFO c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=8f5e2b68-cbd1-45b6-a5f9-68fa46855fce, docPath=/okm:root/14F-Intro to Python-3.3.pdf, docVerUuid=e08e3c13-43ab-449d-9b9a-1a3fa891f6ed, date=Tue Sep 27 11:27:57 UTC 2022} 2022-09-27 11:30:00,088 [Thread-209] WARN c.o.extractor.CuneiformTextExtractor - Undefined OCR application 2022-09-27 11:30:00,089 [Thread-209] WARN com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/14F-Intro to Python-3.3.pdf': Too few text extracted

The files I've tested are:

https://www.tug.ca/tec/Sessions/Handouts/PDF/14F-Intro%20to%20Python-3.3.pdf https://tldp.org/LDP/intro-linux/intro-linux.pdf

6.3.9 doesn't have this problem.

aldemira commented 2 years ago

I kinda feel ashamed but I think I forgot to delete the local volume (tomcat) which was the issue this time. So reinstalled again and now search and text extraction works. Sorry for spamming your inbox (yet again)

monkiki commented 2 years ago

Anyway, if you have these kind of problems again, check the list of text extractor because you may have collisions.

Best regards.