tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.42k stars 9.53k forks source link

PDF output mangling image for TIFF input #535

Closed jbreiden closed 7 years ago

jbreiden commented 7 years ago

This means api->GetInputImage() is giving us a processed image.

test.tif.zip test.pdf

jbreiden commented 7 years ago

Emergency workaround while I go hunt down root cause.

--- tesseract/api/pdfrenderer.cpp   2016-11-21 08:45:47.000000000 -0800
+++ tesseract/api/pdfrenderer.cpp   2016-12-05 14:15:42.000000000 -0800
@@ -841,8 +841,8 @@
 bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
   size_t n;
   char buf[kBasicBufSize];
-  Pix *pix = api->GetInputImage();
   char *filename = (char *)api->GetInputName();
+  Pix *pix = pixRead(filename);
   int ppi = api->GetSourceYResolution();
   if (!pix || ppi <= 0)
     return false;
jbreiden commented 7 years ago

This change also does it, at the cost of memory. And probably leaks.

--- tesseract/api/baseapi.cpp   2016-12-05 08:51:32.000000000 -0800
+++ tesseract/api/baseapi.cpp   2016-12-05 14:47:16.000000000 -0800
@@ -523,7 +523,7 @@
   if (InternalSetImage()) {
     thresholder_->SetImage(imagedata, width, height,
                            bytes_per_pixel, bytes_per_line);
-    SetInputImage(thresholder_->GetPixRect());
+    SetInputImage(pixCopy(NULL, thresholder_->GetPixRect()));
   }
 }

@@ -545,7 +545,7 @@
 void TessBaseAPI::SetImage(Pix* pix) {
   if (InternalSetImage()) {
     thresholder_->SetImage(pix);
-    SetInputImage(thresholder_->GetPixRect());
+    SetInputImage(pixCopy(NULL, thresholder_->GetPixRect()));
   }
 }
jbreiden commented 7 years ago

This one is probably best.

--- tesseract/ccmain/thresholder.cpp    2016-03-11 14:29:36.000000000 -0800
+++ tesseract/ccmain/thresholder.cpp    2016-12-05 15:00:46.000000000 -0800
@@ -225,7 +225,7 @@
 Pix* ImageThresholder::GetPixRect() {
   if (IsFullImage()) {
     // Just clone the whole thing.
-    return pixClone(pix_);
+    return pixCopy(pix_);
   } else {
     // Crop to the given rectangle.
     Box* box = boxCreate(rect_left_, rect_top_, rect_width_, rect_height_);
@@ -322,4 +322,3 @@
 }

 }  // namespace tesseract.
-
jbreiden commented 7 years ago

This bug happens when:

So for example, this example is TIFF G4. Converting to an identical looking TIFF LZW grayscale does not tickle this bug.

jbreiden commented 7 years ago

Ray found the exact spot. This is the final answer.

--- tesseract/ccmain/thresholder.cpp    2016-03-11 14:29:36.000000000 -0800
+++ tesseract/ccmain/thresholder.cpp    2016-12-05 15:27:45.000000000 -0800
@@ -181,8 +181,9 @@
 // Caller must use pixDestroy to free the created Pix.
 void ImageThresholder::ThresholdToPix(PageSegMode pageseg_mode, Pix** pix) {
   if (pix_channels_ == 0) {
-    // We have a binary image, so it just has to be cloned.
-    *pix = GetPixRect();
+    // We have a binary image, so it just has to be copied.
+    // Don't clone or you'll mess up api->GetInputImage()
+    *pix = pixCopy(NULL, GetPixRect());
   } else {
     OtsuThresholdRectToPix(pix_, pix);
   }
@@ -322,4 +323,3 @@
 }

 }  // namespace tesseract.
-
jbreiden commented 7 years ago

Note that this bug affects all versions of Tesseract capable of producing PDF output, both 3.0.x and 4.x.

jbreiden commented 7 years ago

... And the code above is leaky. Ray is doing the final final final version right now.

theraysmith commented 7 years ago

Fixed in commit 7744da9..025689f.