openkm / document-management-system

OpenKM is a Open Source Document Management System
https://www.openkm.com/
GNU General Public License v2.0
700 stars 303 forks source link

OCR not working correctly in OpenKM 6.3.11 #303

Closed TheKvist closed 3 years ago

TheKvist commented 3 years ago

Hello there,

right away, I'm a complete noob when it comes to both, OpenKM, and OCR and I am currently just experimenting around, trying to get things to work based on the OpenKM Docker image. However, I am having trouble getting OCR to work. According to the docs, Tesseract seems to be the recommended OCR engine to use, so having no idea about anything in this topic, I chose to stick with it. However, I have not been able to get it to work.

After more testing, I have confirmed the issue to occur only with openkm/opence:6.3.11, versions 6.3.9 and 6.3.8 work fine, so the issue info is:

OS: Windows 10 using Ubuntu 18.04 in WSL2 Docker Desktop: 3.5.2 (66501) Docker Engine: 20.10.7 Docker Compose: 1.29.2 OpenKM: 6.3.11

and

OS: Debian 10 Buster Docker Engine: 20.10.7, build f0df350 Docker Compose: 1.29.2, build 5becea4c OpenKM: 6.3.11

The 6.3.11 Docker image has Tesseract version 4.0.0-beta.1 pre-installed and running the binary from a bash inside the container actually works as intended. As dictated by the docs, I changed system.ocr to /usr/bin/tesseract ${fileIn} ${fileOut}.

image

After configuration was done I went to Administration > Utilities > Check text extraction and uploaded an image I verified Tesseract was able to process correctly (specifically, this one). The Utility finishes almost immediately with an empty result, saying it used com.openkm.extractor.AbbyTextExtractor, even though this extractor is not even configured anywhere.

Here's a screenshot of the result, and of the registered.text.extractors with an AbbyTextExtractor nowhere to be found

Test result:
image

registered.text.extractors:
image

To much dismay, the logs stay completely silent when testing the extraction, but when uploading the image in question, I get a single

openkm_1  | 2021-08-30 09:25:00,068 [Thread-387] INFO  c.o.extractor.TextExtractorWorker - processSerial.Working on {docUuid=104695f5-d697-48f6-9eb5-3d305f6491b4, docPath=/okm:root/test/text-recognized-eng.png, docVerUuid=2cc47a6a-8782-4663-b32c-77c1ea110f3e, date=Mon Aug 30 09:24:16 UTC 2021}
openkm_1  | 2021-08-30 09:25:00,751 [Thread-387] WARN  com.openkm.dao.NodeDocumentDAO - There was a problem extracting text from '/okm:root/test/text-recognized-eng.png': Too few text extracted

Unfortunately, this is not enough detail to make out what exactly the problem is, but my strongest guess right now is that this is happening because OpenKM only has a Tesseract3TextExtractor but not a Tesseract4TextExtractor, but as said, this is merely a guess and I'm not even sure if this makes any difference as the configured system.ocr command should still work with Tesseract 4.

So basically, I have three questions:

  1. Why does OpenKM think it should use AbbyTextExtractor in the first place, where is this configured?
  2. How can I tell OpenKM not to use it?
  3. How can I get OCR using Tesseract to work?

Thank you so much!

TheKvist commented 3 years ago

Some more info. I tried to get logging to work, recreating the whole container including the database, after which - to my surprise - OpenKM chose the Tesseract3TextExtractor. I then recreated the whole thing again, just to see if I could reproduce it and suddenly, OpenKM selected the BarcodeTextExtractor. Recreating everything again had no effect, it always chooses this extractor now.

Moreover, the docs on logging are incorrect. For 6.3 CE, it states that we need to create a /opt/tomcat/conf/log4j.properties, but creating such a file has no effect at all. Also, there is no "automatic reload" of the configuration taking place at any point, editing said file.

After investigating the container logs, I found it said it was loading logback configuration, which, again according to the docs, are available from 6.4.28 onward, which I don't even use. However, after editing the /opt/tomcat/logback.xml, logging changed as expected, so 6.3 CE seems to use logback, not log4j.

With DEBUG logging enabled for the whole com.openkm package, testing text extraction gave me the following lines:

openkm_1  | 2021-08-30 10:39:05,783 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.s.a.CheckTextExtractionServlet - doPost(SecurityContextHolderAwareRequestWrapper[ org.springframework.security.web.context.HttpSessionSecurityContextRepository$Servlet3SaveToSessionRequestWrapper@22ea75ca], org.springframework.security.web.context.HttpSessionSecurityContextRepository$SaveToSessionResponseWrapper@208875a3)
openkm_1  | 2021-08-30 10:39:05,790 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:39:05,791 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.core.Config - getManager(classpath://com.openkm.extractor.**)
openkm_1  | 2021-08-30 10:39:07,061 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.HTMLTextExtractor)
openkm_1  | 2021-08-30 10:39:07,066 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsWordTextExtractor)
openkm_1  | 2021-08-30 10:39:07,067 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.NativeMsExcelTextExtractor)
openkm_1  | 2021-08-30 10:39:07,069 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.ExifTextExtractor)
openkm_1  | 2021-08-30 10:39:07,070 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.OpenOfficeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,072 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsExcelTextExtractor)
openkm_1  | 2021-08-30 10:39:07,073 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsPowerPointTextExtractor)
openkm_1  | 2021-08-30 10:39:07,074 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.SourceCodeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,075 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.BarcodeTextExtractor)
openkm_1  | 2021-08-30 10:39:07,076 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.CuneiformTextExtractor)
openkm_1  | 2021-08-30 10:39:07,077 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.XMLTextExtractor)
openkm_1  | 2021-08-30 10:39:07,079 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.RTFTextExtractor)
openkm_1  | 2021-08-30 10:39:07,080 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.PlainTextExtractor)
openkm_1  | 2021-08-30 10:39:07,081 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.PdfTextExtractor)
openkm_1  | 2021-08-30 10:39:07,084 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.Tesseract2TextExtractor)
openkm_1  | 2021-08-30 10:39:07,084 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.Tesseract3TextExtractor)
openkm_1  | 2021-08-30 10:39:07,086 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsOffice2007TextExtractor)
openkm_1  | 2021-08-30 10:39:07,086 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.OOTextExtractor)
openkm_1  | 2021-08-30 10:39:07,087 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.AudioTextExtractor)
openkm_1  | 2021-08-30 10:39:07,088 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.AbbyTextExtractor)
openkm_1  | 2021-08-30 10:39:07,089 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.GenericDAO - findByPk(com.openkm.extractor.MsOutlookTextExtractor)
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.BarcodeTextExtractor
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:39:07,090 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.BarcodeTextExtractor
openkm_1  | 2021-08-30 10:39:07,325 [http-nio-0.0.0.0-8080-exec-12] WARN  c.o.extractor.BarcodeTextExtractor - Failed to extract barcode text
openkm_1  | com.google.zxing.NotFoundException: null

So for some reason it seems to choose the BarcodeTextExtractor now before it even gets to consider any of the Tesseract*TextExtractor classes.

TheKvist commented 3 years ago

Another update, I am currently constantly recreating the setup to see what happens. Now, after being on Tesseract3TextExtractor once more, it has changed to CuneiformTextExtractor with this log:

openkm_1  | 2021-08-30 10:59:52,617 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.s.a.CheckTextExtractionServlet - doPost(SecurityContextHolderAwareRequestWrapper[ org.springframework.security.web.context.HttpSessionSecurityContextRepository$Servlet3SaveToSessionRequestWrapper@4ac02e4f], org.springframework.security.web.context.HttpSessionSecurityContextRepository$SaveToSessionResponseWrapper@1c03a69b)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.CuneiformTextExtractor
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - findExtractors(false)
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.RegisteredExtractors - Text extractor for 'image/png' found: class com.openkm.extractor.CuneiformTextExtractor
openkm_1  | 2021-08-30 10:59:52,619 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.MimeTypeDAO - findByName(image/png)
openkm_1  | 2021-08-30 10:59:52,623 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.dao.MimeTypeDAO - findByName: {id=19, name=image/png, description=PNG, search=true, imageMime=image/gif, imageContent=[BIG], extensions=[png]}
openkm_1  | 2021-08-30 10:59:52,624 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - runCmd(/usr/bin/tesseract /opt/tomcat/temp/okm3900154784953022994.png /opt/tomcat/temp/okm2471433750520939535.txt)
openkm_1  | 2021-08-30 10:59:52,625 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - runCmdImpl([/usr/bin/tesseract, /opt/tomcat/temp/okm3900154784953022994.png, /opt/tomcat/temp/okm2471433750520939535.txt], 300000)
openkm_1  | 2021-08-30 10:59:53,195 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - Normal program termination
openkm_1  | 2021-08-30 10:59:53,196 [http-nio-0.0.0.0-8080-exec-12] DEBUG com.openkm.util.ExecutionUtils - Elapse time: 00:00:00
openkm_1  | 2021-08-30 10:59:53,196 [http-nio-0.0.0.0-8080-exec-12] DEBUG c.o.extractor.CuneiformTextExtractor - TEXT:

Again, CuneiformTextExtractor is not even visible in the registered.text.extractors and I have not the slightest idea where OpenKM gets these classes from. Instead of before with the BarcodeTextExtractor, where it listed a whole bunch of extractors, it now only seems to know this class.

TheKvist commented 3 years ago

Okay, so after conducting some more tests, I am absolutely convinced this is a bug. OpenKM almost reliably cycles Tesseract3, Abby, Cuneiform, and Barcode between new setups.

In the Wiki it's stated that the registered.text.extractors property needs to be modified. However, as demonstrated before, none of the extractors OpenKM selects, except for the Tesseract3TextExtractor which would be the correct one, is registered there. Just to make sure, I've deleted everything except Tesseract from that list, but OpenKM simply ignores this property and chooses whatever it wants.

I've found that the system.ocr property has nothing to do with it, but it's rather looking like OpenKM is "rolling a die" on startup which extractor to use, and stick to that until it's completely recreated from scratch.

I've created a demo repository to be found here with which you can test this, the README contains info on how to quickly nuke the setup and create a new one, after which OpenKM is almost guaranteed to have chosen a different TextExtractor, with the BarcodeTextExtractor being the most common.

I'll check out the source later today and have a go at debugging the RegisteredExtractors to see if I can find why that's happening.

darkman97i commented 3 years ago

Here have the code of current TextExtractors https://github.com/openkm/document-management-system/tree/master/src/main/java/com/openkm/extractor

The main class that process each extractor based on document type is RegisteredExtractors, there can see the warning https://github.com/openkm/document-management-system/blob/eca35fc077d54f1941cf8244d1d6641e7ef4ed7e/src/main/java/com/openkm/extractor/RegisteredExtractors.java#L122 ( because text length have a length less than 16 characters )

TheKvist commented 3 years ago

Thanks for getting in touch,

don't worry, you don't need to explain to me what a WARN is :) I posted it there because I expected to see some ERROR, but since I didn't, I simply wanted to copy what exactly I was seeing so you knew what I was talking about.

Nevertheless, thank you for explaining the underlying problem with the ambiguity of extractors to me, and thank you even more for pointing towards the plugin section as that solved my problem/answered my initial question. Disabling Abby, Cuneiform, and Barcode in the TextExtractor plugin, of course, did the trick for 6.3.11 there.

I feel this info should absolutely go into the docs about setting up OCR, don't you think so as well? I spent the better part of a day to find out what's going on there and wasn't sure if it was me or OpenKM who lost their mind there.

So thanks for clearing that up for me, again, it wasn't at all about the warning, but about the unpredictable behavior making no sense to me :)

Issue will be closed, but this really needs to be documented

darkman97i commented 2 years ago

Hi @TheKvist

Following your suggestion, I have updated the documentation description at https://docs.openkm.com/kcenter/view/okm-6.3-com/configuring-ocr-engine.html and added a new section at https://docs.openkm.com/kcenter/view/okm-6.3-com/plugins.html ( hope now will be more clear ).