tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.37k stars 9.42k forks source link

intraword spacing for slightly better pdf copy-paste performance #1900

Closed jbreiden closed 5 years ago

jbreiden commented 6 years ago

Environment

Tesseract Version: 4.x Platform: all

Current Behavior:

So-so copy paste performance of Tesseract PDF files, especially on pdf.js and Apple Preview. Some words get merged together.

Expected Behavior:

More consistent performance.

Suggested Fix:

Interword spacing like @jbarlow83 does. https://github.com/jbarlow83/OCRmyPDF/pull/225

--- tesseract/api/pdfrenderer.cpp   2017-07-14 07:32:13.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2018-09-06 14:45:26.000000000 -0700
@@ -471,6 +471,10 @@
       }
       res_it->Next(RIL_SYMBOL);
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
+    if (res_it->IsAtBeginningOf(RIL_WORD)) {
+      pdf_word += "0020";
+      pdf_word_len++;
+    }
     if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
       double h_stretch =
           kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));

This is still at the idea stage. Still need to try and look at compatibility. Here's some examples if anyone wants to compare on their favorite PDF viewer. Will only use if people report improvement somewhere and no regression anywhere.

samples.zip

Shreeshrii commented 6 years ago

hin.pdf - Windows 10

Application Control Experiment Comment
Chrome correct display correct display both same
Microsoft Edge गंभीर-हढ मुद्रा में गंभी र-हढ मुद्रा में Control OK (1 line in error, extra space before every EOL, in experiment)
Adobe Reader DC गंभीर-हढ मुद्रा में गंभी र-हढ मुद्रा में Control OK (1 line in error in Experiment)
Opera गंभीर-हढ मुद्रा में गंभी र-हढ मुद्रा में Control OK (2 lines in error in Experiment)
Foxit Reader 9.0 गंभीर-हढ मुद्रा में गंभी र-हढ मुद्रा में Control OK (2 lines in error in Experiment)
Chrome pdf.js extension only 4-5 spaces all correct spaces EXPERIMENT BETTER -no new line in either

So, most change as expected is in pdf.js

CONTROL

1 पितानेविवाहकी।हो गईउद्विग्नवहसोचा,मैं कहचुकीहूं,पति,किसीव्यक्तिको,क्योंन भलेकहाहोखेलहो खेलमें ।पतितो वे बनचुकेकरसकतीअबकैसेमैं विवाहअन्यसे ?रखदियेआन्दोलितविच।रसमक्षमाता-पिताके ।मुस्कराकरबोलीमांभोलीहै बेटीतूछोडबाल-विचारोंको,बननहींनादानअब,हँसेगीदुनियाहोगीसिद्धमूर्खतासमाज-रिव्तेदारों पर।बोलीश्रीमतीगंभीर-हढमुद्रामेंमतकरोमां ! बाध्यमुझकरनेदूसराविवाह,रहेगापतिवहीमेराधारचुकीहूं जिसेमैं ¦वात्सल्यकी बेडी१२४

EXPERIMENT

1 पिताने विवाह की । हो गई उद्विग्न वह सोचा, मैं कह चुकी हूं, पति, किसी व्यक्ति को, क्यों न भले कहा हो खेल हो खेल में । पति तो वे बन चुके कर सकती अब कैसे मैं विवाह अन्य से ? रख दिये आन्दोलित विच।र समक्ष माता-पिता के । मुस्करा कर बोली मां भोली है बेटी तू छोड बाल-विचारों को, बन नहीं नादान अब, हँसेगी दुनिया होगी सिद्ध मूर्खता समाज-रिव्तेदा रों पर । बोली श्रीमती गंभी र-हढ मुद्रा में मत करो मां ! बाध्य मुझ करने दूसरा विवाह, रहेगा पति वही मेरा धार चुकी हूं जिसे मैं ¦ वात्सल्य की बेडी १२४

Shreeshrii commented 6 years ago

2.pdf - windows 10 - pdf viwer extension in chrome using pdf.js

chrome://extensions/?options=oemmndcbldboiebfnladdacbdfmadadm

CONTROL

1 EXPERIENCEANDRELATIVITY59componentsof therelativemotionsof thefixed—starswithrespectto theearthonthecolourofthelightreachingus fromthem.Thelattereffectmanifestsitselfin a slightdisplacementof thespectrallinesof thelighttransmittedto us froma fixedstar,as comparedwiththepositionof thesamespectrallineswhentheyareproducedbyaterrestrialsourceoflight(Dopplerprinciple).TheexperimentalargumentsinfavouroftheMaxwell—Lorentztheory,whichareat the;santetimeargumentsin favourof thetheoryof rela—tivity,aretoonumerousto be setforthhere.Inrealitytheylimitthetheoreticalpossibilitiestosuchan extent,thatno othertheorythanthatofMaxwellandLorentzhasbeenableto holditsownwhentestedby experience.Buttherearetwoclassesof experimentalfactshithertoobtainedwhichcanbe representedin theMaxwell—Lorentztheoryonlybytheintroductionof anauxiliaryhypothesis,whichin itselfwithoutmakinguseof thetheoryof relativity—appearsextraneous.It is knownthatcathoderaysandtheso—calledB—raysemittedbyradioactivesubstancesconsistof negativelyelectrifiedparticles(electrons)ofverysmallinertiaandlargevelocity.Byexamin—ingthedeflectionof theseraysundertheinfluenceof electricandmagneticfields,wecanstudythelawof motionof theseparticlesveryexactly.

EXPERIMENT

1 EXPERIENCE AND RELATIVITY 59 components of the relative motions of the fixed — stars with respect to the earth on the colour of the light reaching us from them. The latter effect manifests itself in a slight displacement of the spectral lines of the light transmitted to us from a fixed star, as compared with the position of the same spectral lines when they are produced by a terrestrial source of light (Doppler principle). The experimental arguments in favour of the Maxwell—Lorentz theory, which are at the; sante time arguments in favour of the theory of rela— tivity, are too numerous to be set forth here. In reality they limit the theoretical possibilities to such an extent, that no other theory than that of Maxwell and Lorentz has been able to hold its own when tested by experience. But there are two classes of experimental facts hitherto obtained which can be represented in the Maxwell—Lorentz theory only by the introduction of an auxiliary hypothesis, which in itself without making use of the theory of relativity — appears extraneous. It is known that cathode rays and the so—called B—rays emitted by radioactive substances consist of negatively electrified particles (electrons) of very small inertia and large velocity. By examin— ing the deflection of these rays under the influence of electric and magnetic fields, we can study the law of motion of these particles very exactly.

Jmuccigr commented 5 years ago

I checked the document 2.pdf in MacOS Preview (v. 10.0), Adobe Acrobat Reader DC (Build 19.10.20069.311970), and Chromium (v. Version 71.0.3578.98, which uses the built-in PDF plugin, it seems), all on High Sierra (10.13.6).

In all three, the experiment showed the text-box clipping already mentioned elsewhere, where the selected text extends beyond the colored box, especially on the right (end) side of the word. For Adobe and Chromium, text copied from each version was identical. For Preview the control text showed several places where spaces were missing. The experiment did not.

Preview was also removed line endings that the other two left in. Whether or not that's a good thing might be a matter of personal preference. (It did a pretty good job at detecting the paragraphs, but left in the end-of-line hyphenation with trailing space.)

Preview also shows in both cases an odd effect when I did a "select all." See the image. This didn't seem to effect the text output.

screenshot 2019-01-30 08 33 20
Jmuccigr commented 5 years ago

So?

jbarlow83 commented 5 years ago

PDF RM 14.8.2.5 seems to justify the insertion of spaces (U+0020) because it says:

The conforming reader does not need to guess about word breaks based on information such as glyph positioning on the page, font changes, or glyph sizes.

In particular, a SPACE (U+0020) or other word-breaking character is still needed even if a word break happens to fall at the end of a show string.

pdfrenderer uses a show string for each word ([ <...utf16be encoded...> ] TJ) so this implies there should be a space.

Some conforming readers may identify words by simply separating them at every SPACE character. Others may be slightly more sophisticated and treat punctuation marks such as hyphens or em dashes as word separators as well. Still others may identify possible line-break opportunities by using an algorithm similar to the one in Unicode Standard Annex #29, Text Boundaries, available from the Unicode Consortium (see the Bibliography).

Without spaces we are asking the PDF reader to guess about word breaks.

If I understand correctly the code above places the space before a word, not after? Maybe it's better to place the space after the word, unless the word is last on the line.

Jmuccigr commented 5 years ago

Any progress on this?

Shreeshrii commented 5 years ago

If I understand correctly the code above places the space before a word, not after? Maybe it's better to place the space after the word, unless the word is last on the line.

@jbarlow83 It would be great if you can submit a PR.

jbreiden commented 5 years ago

My intent was the thing you expect - just spaces between words in a line. Nothing before first word, nothing after last.

Shreeshrii commented 5 years ago

@jbreiden I suggest then that we should apply your patch.

dohoho commented 2 years ago

Hi @jbreiden I've applied your patch for this issue and I just increased pdf_word_len++ without adding extra space for each word like this

if (res_it->IsAtBeginningOf(RIL_WORD)) {
   //pdf_word += "0020";
  pdf_word_len++;
}

I saw that copy-paste problem get fixed. So I think the main root cause comes from Tz.