Closed THausherr closed 1 month ago
Interesting suggestion. If correct, why would it show up as an n - 1 problem in highlighting?
Sorry, I don't understand what you mean. My argument is that the highlights widths don't match. Adobe gets these from the font data, and widths are different in a proportional font. And it isn't just the "n". When trying to highlight the "I" it looks like this:
The glyphless font deliberately uses equal width for every character. I stretch the the word using Tz in the PDF to make it fit. So I expect word highlighting to look correct, but not character highlighting within a word. This design was chosen to maximize compatibility across all the scripts supported by Tesseract while minimizing complexity.
I had a look with the glyph contour display of PDFBox and there it matches the word bounds:
So maybe Adobe is to blame, but users will of course see this differently :-(
I think I found a bit more... "Introduction" has 12 characters but looks like this in the PDF content stream:
1 0 0 1 77.76 738.16 Tm /f-0-0 11 Tf 107.076 Tz [ <0049006E00740072006F00640075006300740069006F006E0020> ] TJ
this is 13 characters. The last one (0020) is a space. This space is positioned over the final "n".
When removing "3 Tr" so that the "invisible" font gets visible, it looks like this:
This is really 13 characters. For some reason, Adobe doesn't want to mark the final space.
I just see that the PDFBox screenshot shows it too: "ISO" has 4 characters, "32000" has 6 characters.
Maybe the original idea was to put the space there for text extraction? However it isn't needed, good text extractors "imagine" the space from the position differences.
If the space character is needed, then it should be positioned over the actual space.
Thanks, after reading that one, I think this issue is also somewhat duplicate of https://github.com/jbarlow83/OCRmyPDF/issues/450 .
You should check the bounding box of the whole word 'Introduction' with the hocr format. Does it also end before the last glyph?
Tesseract's recognizer just finds words, and doesn't tell us anything about spaces. Which makes sense: how would an OCR program know if there is one space, two spaces, etc? We add the space in during PDF generation to help some viewer with copy-paste; otherwise it is common for words to run together. Apple's viewer is notorious for this. I'm a little reluctant to put a space outside the word bounding box - there is no guarantee there will be room for it, and I don't really want the PDF output module to get into the layout analysis game. One possibility might be to play with the font such that U+0020 gets zero (or non-zero) width, while every other character maintains the same fixed width we've always had. Then adjust the Tz word stretch appropriately.
https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L471
I haven't touched the font in a while, so not sure how easy it is to make a change like this. If you want to play with this yourself, I recommend using the program "ttx" from fonttools to transform the font into an XML file. Edit the file, then transform it back. I have a feeling it won't be trivial but it might be possible. See also the design discussion at the top of pdfrenderer.cpp, which explains how everything works.
Yeah I understand that this feature was implemented to "help" low quality text extractors.
How about making the feature configurable for PDF? IMHO the majority user expectation is whatever Adobe does, that is the gold standard.
Zero width space also sounds like an interesting idea to explore. You probably have to add appropriate /W entries.
(The reason I created this issue: we're using a commercial OCR tool on a project that grows fast. The OCR is fine, but licensing is a pain, it doesn't use all CPU cores, and the logging is almost non existent, the whole thing is a black box, so I was thinking about replacing it with tesseract, but before we discuss this with the client I need to be sure the client would be satisfied and that its clients too)
@amitdo The bounding box is correct:
<div class='ocr_carea' id='block_1_2' title="bbox 324 400 643 442">
<p class='ocr_par' id='par_1_2' lang='eng' title="bbox 324 400 643 442">
<span class='ocr_line' id='line_1_2' title="bbox 324 400 643 442; baseline 0 -1; x_size 47.393444; x_descenders 6.3934426; x_ascenders 11">
<span class='ocrx_word' id='word_1_3' title='bbox 324 400 643 442; x_wconf 95'>Introduction</span>
</span>
</p>
</div>
Adobe Acrobat is not as popular as it used to be 10 years ago.
Default PDF viewers:
So most users will use the OS/browser's built-in PDF viewers, which is not Adobe's viewer.
The best solution is to find a method that will work on all these viewers, without a special parameter for specific viewer.
I tested your pdf file with Chromium (pdfium), Firefox (pdf.js) and Evince (poppler).
The words bounding boxes look very good when the page is viewed with pdfium/pdf.js.
Poppler suffers from the same issue you raised above combined with a 'zebra effect'.
With PDF.js on firefox, double click marks the whole word, when I mark the final "n", I get a space.
With Chrome, double click shows the same effect than with Adobe Reader.
With MS Edge, same effect than with PDF.js.
I took a look at the code. It looks like one can pretty easily remap U+0020 to an alternate glyph in the cidtogmap. It's been five years since the last significant change, and my memory is terrible, but I I'm confident we currently map everything down to a single "glyph" in the font. That slightly misleading code at line 549 of pdfrenderer.cpp is just filling out the 2 byte entries one byte at a time.
So then there's the question of adding a another glyph to the font. The design notes from Ken say we've got an unused glyph at index 0. Unused because it gives heartburn to the Adobe parser. And then one at index one which is used everywhere. It's not quite trivial, but I don't yet see any reason we can't add another entry at index 2 that is identical or near to the entry in index 1. This means tranforming the font to xml using ttx from fonttools, doing some careful copy pasting, transforming it back, and hoping nothing too scary happens.
Next there is the question of assigning the zero width (or near zero width) to just that new entry. As of right now, I'm not sure exactly how to do that. But I think Tilman's suggestion of adding a /W array to the /CIDFont dictionary is the first thing to try. (Currently line 526 in pdfrenderer.cpp). There's probably spot inside the font as well to specify width, that we'll want to also set, for consistency, compatibility, and minimal confusion.
Finally, I already mentioned that the bounding box stretch can be computed without considering the U+0020, which is basically removing line 471 from pdfrenderer.cpp. After that - if it works at all - then just compatibility testing with various renderers.
I really don't know if this will work or not, but there's a chance, and it's my best suggestion for what to try. Might make sense to contact Ken Sharp and see if he has an opinion on the topic. Tilman, I know it's a lot of work but if you want to try this, you will probably get it done significantly faster than me. (Unlike 5 years ago, my day job does not currently intersect with PDF. That doesn't totally stop me, but it does slow things down quite a lot.)
Thanks for the nice comment; my problem is that I haven't done C/C++ for almost 10 years except maintenance of my existing software. I don't even have a dev system up that supports current language standards so I would have to install / understand / learn that first. However I'll keep it this issue in mind when I have more time at work (because this is a work issue).
@jbarlow83, maybe you can help us here.
I'll spend a little time right now and see what I can do.
I tried the simplest thing possible, leaving the font alone and trying to use that glyph at index 0. I expected Adobe Reader to completely choke, and Pdfium/Chrome to work great. Instead, my ancient copy of Adobe Reader 9.5.5 (e.g. the one for Linux) works fine. However, Pdfium/Chrome is highlighting beyond the end of the word. That's what you would expect if Pdfium was ignoring the zero width on index 0.
--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700 +++ pdfrenderer.cpp 2020-02-09 11:18:40.578544848 -0800 @@ -535,6 +536,7 @@ " /Subtype /CIDFontType2\n" " /Type /Font\n" " /DW " << (1000 / kCharWidth) << "\n"
Alternative: (tesseract hocr) + (hocr-pdf (https://github.com/ImageProcessing-ElectronicPublications/hocr-tools)).
Tried modifying the font to add a specific entry for U+0020. Same results, Adobe good, pdfium bad. This is the point where I pause, and people take a look for mistakes. If nobody finds anything, the next step is probably asking for help. That's Ken Sharp about the overall approach & especially the font, and Pdfium folks to help debug why the /W entry does not appear to be honored.
--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 12:00:57.961541649 -0800
@@ -468,7 +468,6 @@
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
if (res_it->IsAtBeginningOf(RIL_WORD)) {
pdf_word += "0020";
- pdf_word_len++;
}
if (word_length > 0 && pdf_word_len > 0) {
double h_stretch =
@@ -535,6 +536,7 @@
" /Subtype /CIDFontType2\n"
" /Type /Font\n"
" /DW " << (1000 / kCharWidth) << "\n"
+ " /W [ 1 [500 1] ]\n"
">>\n"
"endobj\n";
AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
const std::unique_ptr<unsigned char[]> cidtogidmap(
new unsigned char[kCIDToGIDMapSize]);
for (int i = 0; i < kCIDToGIDMapSize; i++) {
- cidtogidmap[i] = (i % 2) ? 1 : 0;
+ cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
}
+ const int kSpaceCID = 20;
+ cidtogidmap[kSpaceCID * 2] = 0x00;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0x02;
size_t len;
unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
stream.str("");
@amitdo I will look.
I'd consider using a separate Tz
for the trailing space rather than modifying the font.
1.0 Tz [ <0049006E00740072006F00640075006300740069006F006E> ] TJ 0.001 Tz [ <0020> ] TJ
Seems like it would be simpler and less reliant on fonts being parsed correctly.
However I do think some artifact of the glyphlessfont is causing trouble, since using a hidden Arial (e.g. the hOCR transform method) does not have these problems for the same content stream.
The /W entry as it is now
means CID 1 has a width of 500, CID 2 has a width of 1. I assume that all others have default width (500). If you wanted to change the width of space, then you should have done something for CID 32.
You are correct. Result works on both Acroread & Pdfium. File attached and ready for compatibility testing. If nobody finds trouble, I'm comfortable submitting. This variant makes no changes to the font, and sets the width of space to zero.
--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp 2020-02-09 13:26:33.743553816 -0800
@@ -468,7 +468,6 @@
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
if (res_it->IsAtBeginningOf(RIL_WORD)) {
pdf_word += "0020";
- pdf_word_len++;
}
if (word_length > 0 && pdf_word_len > 0) {
double h_stretch =
@@ -535,6 +536,7 @@
" /Subtype /CIDFontType2\n"
" /Type /Font\n"
" /DW " << (1000 / kCharWidth) << "\n"
+ " /W [ 32 [0] ]\n"
">>\n"
"endobj\n";
AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
const std::unique_ptr<unsigned char[]> cidtogidmap(
new unsigned char[kCIDToGIDMapSize]);
for (int i = 0; i < kCIDToGIDMapSize; i++) {
- cidtogidmap[i] = (i % 2) ? 1 : 0;
+ cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
}
+ const int kSpaceCID = 0x0020;
+ cidtogidmap[kSpaceCID * 2] = 0x00;
+ cidtogidmap[kSpaceCID * 2 + 1] = 0x00;
size_t len;
unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
stream.str("");
@jbarlow83 The problem with hidden Arial is coverage. Tesseract supports the entire basic multilingual plane and beyond. The glyphless font is equally happy with Cherokee and English.
Chromium, Evince - the page looks good. Firefox - no effect, the issue still exists.
Thank you, I just tested with "testme1.pdf" and double clicking "Introduction" on firefox works fine. But the same for "digital" highlights at the wrong place, even more for "enable" but this may be a firefox bug. I think I reported such a bug myself a long time ago. PDFBox shows correct glyph bounds.
Chrome and Edge work fine.
However PDFBox reports a warning "No glyph for code 32 (CID 0020) in font GlyphLessFont", which didn't happen with the original file. But this is just a minor inconvenience.
Firefox has never worked well with Tesseract PDF. Does this change make it worse?
https://github.com/mozilla/pdf.js/issues/6509
https://github.com/mozilla/pdf.js/issues/6863
I suppose we should also check samples in vertical Japanese, right to left Andrabic, and bidirectional to see if there are any regressions. Plus PDF parsers running on non Linux operating systems.
Firefox has never worked well with Tesseract PDF. Does this change make it worse?
No, at least not with the tested document.
When testing the PDF file produced by Tesseract with a PDF viewer, the tester should test:
<space>
word).@amitdo looking for a solution to this problem I found this issue, so I applied the patch above and there it is, the solution is working. All the tests you described are positive, at least on the couple of documents I checked it on - simple document as well as form with tricky formatting. In all of the cases the text is extracted correctly. The way the text is overlayed is better than without the patch. It's not perfect though, I found couple of places where it still misses the glyph boundaries (mostly when around letters i), but overall it's a huge improvement. I tested it with Acrobat as well as with the Chome built-in viewer on Windows.
@bbqf
Thanks for reporting.
We still didn't hear from Mac users. How well does this patch work with macOS Preview?
This patch unfortunately does not improve results on macOS Preview (Preview 10.1, macOS 10.14.6). Assuming I compared the right files. I did not apply the patch.
Visually, it's better:
Without patch (scan-ocr.pdf):
With patch (testme1.pdf):
However it removes spaces from the copy and paste text:
Without patch (scan-ocr.pdf):
programming language, PDF is based ona structured binary file format that is optimized for high performance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of the page content itself but are useful for interactive viewing and document interchange.
With patch (testme1.pdf):
programminglanguage,PDFisbasedona structuredbinaryfileformatthatisoptimizedforhighperformance in interactive viewing. PDF also includes objects, such as annotations and hypertext links, that are not part of thepagecontentitselfbutareusefulforinteractiveviewinganddocumentinterchange.
I compared the previously uploaded files indicated above without applying the patch.
Definitely looks like one-off bug.
Maybe it does not like the zero width space, and it will honor a 1 unit width.
Hi, do we have any update on this issue?
I'd love to contribute and finally get the fix released, but I have no access to Mac, and as I reported earlier, the fix works for me on Windows. Is there a cross-platform way to test it? I am fine with Linux/Docker/VMs but I can't help with Mac.
@bbqf It's possible to set up VM for macOS guest on Windows or Linux. e.g. https://www.makeuseof.com/tag/macos-windows-10-virtual-machine/ I can often be persuaded to test new files and I have access to all platforms.
@jbarlow83,
https://github.com/tesseract-ocr/tesseract/issues/2879#issuecomment-583892594
Can you please try to implement your suggestion and test it?
Hello!
I would like to implement this fix, since we feed the PDFs Tesseract generates into Poppler, It's not a problem if it breaks behaviour on other PDF renderers.
But I don't want to maintain a fork of Tesseract and have to compile it myself. So my idea was to extract the essence of the fix and apply them after the fact, to the PDFs that Tesseract generates. However I am not having success. Perhaps someone can advise me on exactly the mutation I need to carry out on the PDF in order to benefit from this fix?
I have tried: (code examples are in rust)
for operation in &mut content.operations {
match operation.operator.as_ref() {
"Tj" | "TJ" => {
for operand in operation.operands.iter_mut() {
match operand {
Object::Array(ref mut arr) => {
for obj in arr {
let obj = obj.as_str_mut().unwrap();
if obj[obj.len() - 2..] == [0, 32] {
obj.truncate(obj.len() - 2);
}
}
}
_ => {}
}
}
}
_ => {}
}
}
let fonts = doc
.objects
.iter()
.filter_map(|(id, obj)| (obj.type_name() == Ok("Font")).then_some(id.to_owned()))
.collect::<Vec<_>>();
for font in fonts {
if let Ok(font) = doc.get_dictionary_mut(font) {
let _32 = Object::Integer(32);
let _0 = Object::Array(vec![Object::Integer(0)]);
font.set(b"W".to_vec(), Object::Array(vec![_32, _0]));
}
}
(admittedly, i'm not sure what this part of the diff is doing)
Neither options result in any visible difference in the PDF for me. Can anyone advise?
At this point I believe improvements would come from having Tesseract generate tagged PDFs with structural markup that indicate word boundaries. @arifd that would implementing section 14.8 of the PDF RM.
Any chance that this issue will be fixed sometimes after 3 years?
Try this file. scan.pdf
If word boundaries look "kinda" ok, I'll commit this one-off fix. Pdf viewers do not highlight space, so we should not add it to the word.
index 6b9e248d..fb4bcd2f 100644
--- a/src/api/pdfrenderer.cpp
+++ b/src/api/pdfrenderer.cpp
@@ -466,7 +466,6 @@ char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double
} while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
if (res_it->IsAtBeginningOf(RIL_WORD)) {
pdf_word += "0020";
- pdf_word_len++;
}
if (word_length > 0 && pdf_word_len > 0) {
double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
In my opinion it looks good now.
It's an improvement.
This patch was rejected before. See #3139.
It works significantly better to output the word and space separately, and use horizontally scaling to calculate the width of the space so it falls exactly between the end of the current of the word and beginning of the next word.
The "words mixed together" issue happens because the actual position of the space will overlap the word boxes instead of being between them, especially if the word is particularly wide or narrow (wwwwwww vs iiiiii). So the PDF renderers are acccurately reporting what they "see".
I implemented this in OCRmyPDF's hOCR based renderer. I won't have time to add it to Tesseract for a few months but that is how to move forward.
The next other thing to do, I believe, is add a double width character and negative displacement character to the GlyphlessFont, to better handle Asian and RTL scripts respectively.
Environment
Call:
"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf
Current Behavior:
text bounds are not identical to visible glyphs in Adobe Reader. Example:
Expected Behavior:
text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".
Suggested Fix:
I suspect that the /W array is missing in the font dictionary:
So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification):
![grafik](https://user-images.githubusercontent.com/6665575/73935364-a406ba80-48e0-11ea-9b76-f5481696c09c.png)
scan-ocr.pdf scan.tif.zip