tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.96k stars 9.27k forks source link

Invisible glyph bounds at wrong positions in PDF #2879

Closed THausherr closed 1 month ago

THausherr commented 4 years ago

Environment

Call:

"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf

Current Behavior:

text bounds are not identical to visible glyphs in Adobe Reader. Example:

grafik

Expected Behavior:

text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".

Suggested Fix:

I suspect that the /W array is missing in the font dictionary: grafik So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification): grafik

scan-ocr.pdf scan.tif.zip

jbreiden commented 4 years ago

Interesting suggestion. If correct, why would it show up as an n - 1 problem in highlighting?

THausherr commented 4 years ago

Sorry, I don't understand what you mean. My argument is that the highlights widths don't match. Adobe gets these from the font data, and widths are different in a proportional font. And it isn't just the "n". When trying to highlight the "I" it looks like this: grafik

jbreiden commented 4 years ago

The glyphless font deliberately uses equal width for every character. I stretch the the word using Tz in the PDF to make it fit. So I expect word highlighting to look correct, but not character highlighting within a word. This design was chosen to maximize compatibility across all the scripts supported by Tesseract while minimizing complexity.

THausherr commented 4 years ago

I had a look with the glyph contour display of PDFBox and there it matches the word bounds: grafik

So maybe Adobe is to blame, but users will of course see this differently :-(

THausherr commented 4 years ago

I think I found a bit more... "Introduction" has 12 characters but looks like this in the PDF content stream: 1 0 0 1 77.76 738.16 Tm /f-0-0 11 Tf 107.076 Tz [ <0049006E00740072006F00640075006300740069006F006E0020> ] TJ this is 13 characters. The last one (0020) is a space. This space is positioned over the final "n".

THausherr commented 4 years ago

When removing "3 Tr" so that the "invisible" font gets visible, it looks like this: grafik This is really 13 characters. For some reason, Adobe doesn't want to mark the final space.

THausherr commented 4 years ago

I just see that the PDFBox screenshot shows it too: "ISO" has 4 characters, "32000" has 6 characters.

Maybe the original idea was to put the space there for text extraction? However it isn't needed, good text extractors "imagine" the space from the position differences.

If the space character is needed, then it should be positioned over the actual space.

amitdo commented 4 years ago

https://github.com/tesseract-ocr/tesseract/issues/1900

THausherr commented 4 years ago

Thanks, after reading that one, I think this issue is also somewhat duplicate of https://github.com/jbarlow83/OCRmyPDF/issues/450 .

amitdo commented 4 years ago

You should check the bounding box of the whole word 'Introduction' with the hocr format. Does it also end before the last glyph?

jbreiden commented 4 years ago

Tesseract's recognizer just finds words, and doesn't tell us anything about spaces. Which makes sense: how would an OCR program know if there is one space, two spaces, etc? We add the space in during PDF generation to help some viewer with copy-paste; otherwise it is common for words to run together. Apple's viewer is notorious for this. I'm a little reluctant to put a space outside the word bounding box - there is no guarantee there will be room for it, and I don't really want the PDF output module to get into the layout analysis game. One possibility might be to play with the font such that U+0020 gets zero (or non-zero) width, while every other character maintains the same fixed width we've always had. Then adjust the Tz word stretch appropriately.

https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L471

I haven't touched the font in a while, so not sure how easy it is to make a change like this. If you want to play with this yourself, I recommend using the program "ttx" from fonttools to transform the font into an XML file. Edit the file, then transform it back. I have a feeling it won't be trivial but it might be possible. See also the design discussion at the top of pdfrenderer.cpp, which explains how everything works.

THausherr commented 4 years ago

Yeah I understand that this feature was implemented to "help" low quality text extractors.

How about making the feature configurable for PDF? IMHO the majority user expectation is whatever Adobe does, that is the gold standard.

Zero width space also sounds like an interesting idea to explore. You probably have to add appropriate /W entries.

(The reason I created this issue: we're using a commercial OCR tool on a project that grows fast. The OCR is fine, but licensing is a pain, it doesn't use all CPU cores, and the logging is almost non existent, the whole thing is a black box, so I was thinking about replacing it with tesseract, but before we discuss this with the client I need to be sure the client would be satisfied and that its clients too)

THausherr commented 4 years ago

@amitdo The bounding box is correct:

   <div class='ocr_carea' id='block_1_2' title="bbox 324 400 643 442">
    <p class='ocr_par' id='par_1_2' lang='eng' title="bbox 324 400 643 442">
     <span class='ocr_line' id='line_1_2' title="bbox 324 400 643 442; baseline 0 -1; x_size 47.393444; x_descenders 6.3934426; x_ascenders 11">
      <span class='ocrx_word' id='word_1_3' title='bbox 324 400 643 442; x_wconf 95'>Introduction</span>
     </span>
    </p>
   </div>
amitdo commented 4 years ago

Adobe Acrobat is not as popular as it used to be 10 years ago.

Default PDF viewers:

So most users will use the OS/browser's built-in PDF viewers, which is not Adobe's viewer.

The best solution is to find a method that will work on all these viewers, without a special parameter for specific viewer.

amitdo commented 4 years ago

I tested your pdf file with Chromium (pdfium), Firefox (pdf.js) and Evince (poppler).

The words bounding boxes look very good when the page is viewed with pdfium/pdf.js.

Poppler suffers from the same issue you raised above combined with a 'zebra effect'.

THausherr commented 4 years ago

With PDF.js on firefox, double click marks the whole word, when I mark the final "n", I get a space.

With Chrome, double click shows the same effect than with Adobe Reader.

With MS Edge, same effect than with PDF.js.

jbreiden commented 4 years ago

I took a look at the code. It looks like one can pretty easily remap U+0020 to an alternate glyph in the cidtogmap. It's been five years since the last significant change, and my memory is terrible, but I I'm confident we currently map everything down to a single "glyph" in the font. That slightly misleading code at line 549 of pdfrenderer.cpp is just filling out the 2 byte entries one byte at a time.

So then there's the question of adding a another glyph to the font. The design notes from Ken say we've got an unused glyph at index 0. Unused because it gives heartburn to the Adobe parser. And then one at index one which is used everywhere. It's not quite trivial, but I don't yet see any reason we can't add another entry at index 2 that is identical or near to the entry in index 1. This means tranforming the font to xml using ttx from fonttools, doing some careful copy pasting, transforming it back, and hoping nothing too scary happens.

Next there is the question of assigning the zero width (or near zero width) to just that new entry. As of right now, I'm not sure exactly how to do that. But I think Tilman's suggestion of adding a /W array to the /CIDFont dictionary is the first thing to try. (Currently line 526 in pdfrenderer.cpp). There's probably spot inside the font as well to specify width, that we'll want to also set, for consistency, compatibility, and minimal confusion.

Finally, I already mentioned that the bounding box stretch can be computed without considering the U+0020, which is basically removing line 471 from pdfrenderer.cpp. After that - if it works at all - then just compatibility testing with various renderers.

I really don't know if this will work or not, but there's a chance, and it's my best suggestion for what to try. Might make sense to contact Ken Sharp and see if he has an opinion on the topic. Tilman, I know it's a lot of work but if you want to try this, you will probably get it done significantly faster than me. (Unlike 5 years ago, my day job does not currently intersect with PDF. That doesn't totally stop me, but it does slow things down quite a lot.)

THausherr commented 4 years ago

Thanks for the nice comment; my problem is that I haven't done C/C++ for almost 10 years except maintenance of my existing software. I don't even have a dev system up that supports current language standards so I would have to install / understand / learn that first. However I'll keep it this issue in mind when I have more time at work (because this is a work issue).

amitdo commented 4 years ago

@jbarlow83, maybe you can help us here.

jbreiden commented 4 years ago

I'll spend a little time right now and see what I can do.

jbreiden commented 4 years ago

I tried the simplest thing possible, leaving the font alone and trying to use that glyph at index 0. I expected Adobe Reader to completely choke, and Pdfium/Chrome to work great. Instead, my ancient copy of Adobe Reader 9.5.5 (e.g. the one for Linux) works fine. However, Pdfium/Chrome is highlighting beyond the end of the word. That's what you would expect if Pdfium was ignoring the zero width on index 0.

--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700 +++ pdfrenderer.cpp 2020-02-09 11:18:40.578544848 -0800 @@ -535,6 +536,7 @@ " /Subtype /CIDFontType2\n" " /Type /Font\n" " /DW " << (1000 / kCharWidth) << "\n"

zvezdochiot commented 4 years ago

Alternative: (tesseract hocr) + (hocr-pdf (https://github.com/ImageProcessing-ElectronicPublications/hocr-tools)).

jbreiden3 commented 4 years ago

Tried modifying the font to add a specific entry for U+0020. Same results, Adobe good, pdfium bad. This is the point where I pause, and people take a look for mistakes. If nobody finds anything, the next step is probably asking for help. That's Ken Sharp about the overall approach & especially the font, and Pdfium folks to help debug why the /W entry does not appear to be honored.

--- pdfrenderer.cpp.orig    2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 12:00:57.961541649 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 1 [500 1] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 20;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x02;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

debug.pdf font.zip

jbarlow83 commented 4 years ago

@amitdo I will look.

I'd consider using a separate Tz for the trailing space rather than modifying the font.

1.0 Tz [ <0049006E00740072006F00640075006300740069006F006E> ] TJ 0.001 Tz [ <0020> ] TJ

Seems like it would be simpler and less reliant on fonts being parsed correctly.

However I do think some artifact of the glyphlessfont is causing trouble, since using a hidden Arial (e.g. the hOCR transform method) does not have these problems for the same content stream.

THausherr commented 4 years ago

The /W entry as it is now grafik means CID 1 has a width of 500, CID 2 has a width of 1. I assume that all others have default width (500). If you wanted to change the width of space, then you should have done something for CID 32.

jbreiden3 commented 4 years ago

You are correct. Result works on both Acroread & Pdfium. File attached and ready for compatibility testing. If nobody finds trouble, I'm comfortable submitting. This variant makes no changes to the font, and sets the width of space to zero.

--- pdfrenderer.cpp.orig    2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 13:26:33.743553816 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 32 [0] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 0x0020;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x00;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

testme1.pdf

jbreiden3 commented 4 years ago

@jbarlow83 The problem with hidden Arial is coverage. Tesseract supports the entire basic multilingual plane and beyond. The glyphless font is equally happy with Cherokee and English.

amitdo commented 4 years ago

Chromium, Evince - the page looks good. Firefox - no effect, the issue still exists.

THausherr commented 4 years ago

Thank you, I just tested with "testme1.pdf" and double clicking "Introduction" on firefox works fine. But the same for "digital" highlights at the wrong place, even more for "enable" but this may be a firefox bug. I think I reported such a bug myself a long time ago. PDFBox shows correct glyph bounds.

Chrome and Edge work fine.

However PDFBox reports a warning "No glyph for code 32 (CID 0020) in font GlyphLessFont", which didn't happen with the original file. But this is just a minor inconvenience.

jbreiden3 commented 4 years ago

Firefox has never worked well with Tesseract PDF. Does this change make it worse?

https://github.com/mozilla/pdf.js/issues/6509

https://github.com/mozilla/pdf.js/issues/6863

I suppose we should also check samples in vertical Japanese, right to left Andrabic, and bidirectional to see if there are any regressions. Plus PDF parsers running on non Linux operating systems.

amitdo commented 4 years ago

Firefox has never worked well with Tesseract PDF. Does this change make it worse?

No, at least not with the tested document.

amitdo commented 3 years ago

When testing the PDF file produced by Tesseract with a PDF viewer, the tester should test:

bbqf commented 3 years ago

@amitdo looking for a solution to this problem I found this issue, so I applied the patch above and there it is, the solution is working. All the tests you described are positive, at least on the couple of documents I checked it on - simple document as well as form with tricky formatting. In all of the cases the text is extracted correctly. The way the text is overlayed is better than without the patch. It's not perfect though, I found couple of places where it still misses the glyph boundaries (mostly when around letters i), but overall it's a huge improvement. I tested it with Acrobat as well as with the Chome built-in viewer on Windows.

amitdo commented 3 years ago

@bbqf

Thanks for reporting.

amitdo commented 3 years ago

We still didn't hear from Mac users. How well does this patch work with macOS Preview?

jbarlow83 commented 3 years ago

This patch unfortunately does not improve results on macOS Preview (Preview 10.1, macOS 10.14.6). Assuming I compared the right files. I did not apply the patch.

Visually, it's better:

Without patch (scan-ocr.pdf):

image

With patch (testme1.pdf):

image

However it removes spaces from the copy and paste text:

Without patch (scan-ocr.pdf):

programming language, PDF is based ona structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.

With patch (testme1.pdf):

programminglanguage,PDFisbasedona structuredbinaryfileformatthatisoptimizedforhighperformance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of thepagecontentitselfbutareusefulforinteractiveviewinganddocumentinterchange.

I compared the previously uploaded files indicated above without applying the patch.

egorpugin commented 3 years ago

Definitely looks like one-off bug.

amitdo commented 3 years ago

Maybe it does not like the zero width space, and it will honor a 1 unit width.

shadylpstan commented 2 years ago

Hi, do we have any update on this issue?

bbqf commented 1 year ago

I'd love to contribute and finally get the fix released, but I have no access to Mac, and as I reported earlier, the fix works for me on Windows. Is there a cross-platform way to test it? I am fine with Linux/Docker/VMs but I can't help with Mac.

jbarlow83 commented 1 year ago

@bbqf It's possible to set up VM for macOS guest on Windows or Linux. e.g. https://www.makeuseof.com/tag/macos-windows-10-virtual-machine/ I can often be persuaded to test new files and I have access to all platforms.

amitdo commented 1 year ago

@jbarlow83,

https://github.com/tesseract-ocr/tesseract/issues/2879#issuecomment-583892594

Can you please try to implement your suggestion and test it?

arifd commented 1 year ago

Hello!

I would like to implement this fix, since we feed the PDFs Tesseract generates into Poppler, It's not a problem if it breaks behaviour on other PDF renderers.

But I don't want to maintain a fork of Tesseract and have to compile it myself. So my idea was to extract the essence of the fix and apply them after the fact, to the PDFs that Tesseract generates. However I am not having success. Perhaps someone can advise me on exactly the mutation I need to carry out on the PDF in order to benefit from this fix?

I have tried: (code examples are in rust)

Neither options result in any visible difference in the PDF for me. Can anyone advise?

jbarlow83 commented 1 year ago

At this point I believe improvements would come from having Tesseract generate tagged PDFs with structural markup that indicate word boundaries. @arifd that would implementing section 14.8 of the PDF RM.

westner commented 5 months ago

Any chance that this issue will be fixed sometimes after 3 years?

egorpugin commented 5 months ago

Try this file. scan.pdf

If word boundaries look "kinda" ok, I'll commit this one-off fix. Pdf viewers do not highlight space, so we should not add it to the word.

index 6b9e248d..fb4bcd2f 100644
--- a/src/api/pdfrenderer.cpp
+++ b/src/api/pdfrenderer.cpp
@@ -466,7 +466,6 @@ char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
westner commented 5 months ago

In my opinion it looks good now.

THausherr commented 5 months ago

It's an improvement.

amitdo commented 5 months ago

This patch was rejected before. See #3139.

jbarlow83 commented 5 months ago

It works significantly better to output the word and space separately, and use horizontally scaling to calculate the width of the space so it falls exactly between the end of the current of the word and beginning of the next word.

The "words mixed together" issue happens because the actual position of the space will overlap the word boxes instead of being between them, especially if the word is particularly wide or narrow (wwwwwww vs iiiiii). So the PDF renderers are acccurately reporting what they "see".

I implemented this in OCRmyPDF's hOCR based renderer. I won't have time to add it to Tesseract for a few months but that is how to move forward.

The next other thing to do, I believe, is add a double width character and negative displacement character to the GlyphlessFont, to better handle Asian and RTL scripts respectively.