tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.83k stars 9.36k forks source link

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Open tbadran opened 8 years ago

tbadran commented 8 years ago

I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!

i.e. original text In Arabic is مرحبا Stored in PDF as text as ابحرم

roozgar commented 8 years ago

​please put your sample file and the command you used for ocr job​

tbadran commented 8 years ago

This is the command:

tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf

Files are attached (source JPG and output PDF)

test_ara test_ara.pdf

please check original word أنحاء output inside PDF is ءاحنا

tbadran commented 8 years ago

Command and Samples are attached now in the previous comment

amitdo commented 8 years ago

Which program are you using to view the PDF?

amitdo commented 8 years ago

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

roozgar commented 8 years ago

@amitdo is there any way to reach a better accuracy in Arabic language until to change to new engine? now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40% but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

tbadran commented 8 years ago

I am using Adobe Reader. But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

tbadran commented 8 years ago

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

amitdo commented 8 years ago

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

amitdo commented 8 years ago

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

amitdo commented 8 years ago

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer. I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image: test_ara

jbreiden commented 8 years ago

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

jbreiden commented 8 years ago

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

jbreiden commented 8 years ago

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

amitdo commented 8 years ago

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

tbadran commented 8 years ago

Please note my testing using the binaries for Windows downloaded from: http://domasofan.spdns.eu/tesseract/ and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

tbadran commented 8 years ago

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

tfmorris commented 8 years ago

On OS X, I'm seeing the opposite of earlier reports:

tfmorris commented 8 years ago

Adobe Acrobat:

امهمه مني اهادم ةييرعلا ةغللا . هم دهج ةغل ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما اللغة العريية لغة جهد مه مسنره هي انحاء العالم

amitdo commented 8 years ago

Tom,

Look at the original jpg. Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome. Clearly, that's the 'good' output...

amitdo commented 8 years ago

Again, in Google Chromium. If I mark the first two lines in the PDF + first word in line 3, copy the (invisible) text, paste it to a text file, mark the second to last word in line 3 in the PDF, copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما اللغة العريية لغة مسنره هي انحاء العالم

jbreiden commented 8 years ago

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

jbreiden commented 8 years ago

There are two things I can think of doing. One is to give up and write Arabic backwards (which I really hate!). The other is to put an entry in the PDF metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about this, slowly.

amitdo commented 8 years ago

@jbreiden I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

amitdo commented 8 years ago

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

jbreiden commented 8 years ago

@amitdo Hebrew has the exact same problem as Arabic.

amitdo commented 8 years ago

Maybe explicitly using unicode bidi control characters can help ?

jbreiden commented 8 years ago

That's another possibility, thanks for the suggestion.

amitdo commented 8 years ago

@jbreiden, any progress? Which way you chose? Personally, I care about our Hebrew support.

jbreiden commented 8 years ago

I am taking a look at this today. With current code, copy-paste works from Chrome, fails from Adobe Reader. Destination is gEdit. All tests are on Linux. I see no difference in Adobe Reader if I insert U+2067 RIGHT-TO-LEFT ISOLATE (RLI) at the beginning of each word, and U+2069 POP DIRECTIONAL ISOLATE (PDI) at the end of each word. It's possible that my copy of Adobe Reader is too old to understand these control characters. Or that I am using them wrong. Too early to tell.

a

b

c

jbreiden commented 8 years ago

The PDF 1.7 specification suggests using a left-to-right transformation matrix (Tm) while giving each character a negative width. A very crude experiment along these lines give good results with Adobe Reader. But messes up cosmetic highlighting in Chrome and copy-paste is wrong with Evince. Please note that font metrics are inconsistent in this experiment.

In writing systems that are read from right to left (such as Arabic or Hebrew), 
one might expect that the glyphs in a font would have their origins at the lower right
and their widths (rightward horizontal displacements) specified as negative. 
[ .. then continues into a horrendous discussion of writing everything backwards ... ]
--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-06 15:35:12.000000000 -0700
@@ -246,6 +246,7 @@
 void AffineMatrix(int writing_direction,
                   int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
+  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
   *a = cos(theta);
@@ -527,7 +528,7 @@
                "endobj\n",
                5L,         // CIDToGIDMap
                7L,         // Font descriptor
-               1000 / kCharWidth);
+               - 1000 / kCharWidth);
   if (n >= sizeof(buf)) return false;
   AppendPDFObject(buf);

Chrome is unhappy f

heb.pdf

amitdo commented 8 years ago

@jbreiden The PDF 1.7 spec refer to:

Unicode Standard Annex #9, The Bidirectional Algorithm, Version 4.0.0

http://www.unicode.org/reports/tr9/tr9-11.html

Support for RLI and PDI has been added in Unicode 6.3. http://www.unicode.org/reports/tr9/tr9-29.html

jbreiden commented 8 years ago

I tried the other control characters U+202b RIGHT-TO-LEFT EMBEDDING and U+202e RIGHT-TO-LEFT OVERRIDE. Even when sprinkled all over the place, neither had any effect with Adobe Reader 9. We still get incorrect copy-paste.

--- tesseract/api/pdfrenderer.cpp   2016-07-06 13:19:57.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2016-07-07 10:55:41.000000000 -0700
@@ -410,6 +410,9 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    pdf_word += "<202E>";
+    pdf_word_len++;
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {

heb.pdf

jbreiden commented 8 years ago

Filed feature request with Adobe to recognize -1 0 0 1 X Y Tm. No idea if they will consider it.

Shreeshrii commented 7 years ago

There are a number of issues relating to RTL and Arabic. Can they all be labelled with 'Arabic' for ease of finding, so that duplicate issues are not created.

https://github.com/tesseract-ocr/tesseract/issues?q=Arabic+in%3Atitle%2Cbody gives a list of the same.

amitdo commented 7 years ago

Hi @Shreeshrii!

Let's see...

169

This is not Arabic specific issue, but an RTL issue. The reported issue was solved.

212

A question, not an issue.

238

PDF issue related to RTL. Not Arabic specific issue.

294

'Moved' to tesseract-ocr/langdata issues reports.

302

Seems to be solved.

325

Original issue was solved.

361

A broad complaint about bad RTL support.

410

Not Arabic specific. Can't be solved.

As said before, once the new LSTM code will finally land in Tesseract's public Github repo, the OCR accuracy of Arabic and Persian will be dramatically improved. Cube's code will be removed, so any issue with it will be irrelevant.

My conclusion: #238 is the only one in the list we should monitor.

The big question left is when we will see Tesseract 4.0 code. Unfortunately, Ray does not yet share any planned date with the Tesseract community :(

zdenop commented 7 years ago

Ray shared that he would like to have public alpha version by the end of September.

stweil commented 7 years ago

That's good news. I promise that we'll give it a try as soon as it is available.

amitdo commented 7 years ago

@stweil,

we'll give it a try...

'We'? The @UB-Mannheim team I guess... :)

Shreeshrii commented 7 years ago

Thanks.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 15, 2016 at 12:20 AM, zdenop notifications@github.com wrote:

Ray shared that he would like to have public alpha version by the end of September.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-247116411, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o53wUhtvKbMbG-B-TAutJfk3h64vks5qqEIDgaJpZM4His6k .

jbreiden commented 7 years ago

I'm currently in discussion with some Adobe folks about this topic.

mehmetaltuntas commented 7 years ago

hi, where can i get the arabic tessdata files? also, where do we get all other language files? thanks

Shreeshrii commented 7 years ago

ara.* from https://github.com/tesseract-ocr/tessdata (Version 3.02)

https://github.com/tesseract-ocr/langdata/tree/master/ara (Version 3.04)

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" notifications@github.com wrote:

hi, where can i get the arabic tessdata files? also, where do we get all other language files? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k .

Shreeshrii commented 7 years ago

The tesseract/langdata/ara repo has the 3.04 source files for Arabic language data.

The Arabic traineddata is based on cube engine and is the 3.02version.

On 21 Oct 2016 11:56 a.m., "ShreeDevi Kumar" shreeshrii@gmail.com wrote:

https://github.com/tesseract-ocr/tessdata

Download all ara.* Files for Arabic

Other language data files are also in same repository

On 21 Oct 2016 6:07 a.m., "Mehmet Altuntas" notifications@github.com wrote:

hi, where can i get the arabic tessdata files? also, where do we get all other language files? thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/238#issuecomment-255288956, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxpzfVP9cDcNP9gxAe_kqigEshyfks5q2DpygaJpZM4His6k .

amitdo commented 7 years ago

@jbreiden Did you find a solution?

roozgar commented 7 years ago

​is there any milestone to drop cube completely!?​

amitdo commented 7 years ago

​is there any milestone to drop cube completely!?​

This issue is not caused by cube.

See https://github.com/tesseract-ocr/tesseract/issues/40#issuecomment-263039665

jbreiden commented 7 years ago

The Adobe folks suggested a few things to try, none of which worked so far. Still open and (relatively) active.

jbreiden commented 7 years ago

Okay, this bug has been open forever. As mentioned before, most PDF files deal with right-to-left (RTL) languages like Hebrew and Arabic by laying out the characters from left-to-right (LTR) but doing it backwards. This offends my programming sensibilities on many levels, and I've resisted this approach. But maybe it is time to swallow pride and wallow in the mud. Here's a few examples from the test suite. How is compatibility for search and copy-paste?

Arabic ara.pdf

Single word Hebrew simplest.pdf

Hebrew + English heb_mivne.pdf

Hebrew + English, tilted heb-tilt.pdf

English (should be no change from what we do now) 2.pdf

--- tesseract/api/pdfrenderer.cpp   2017-03-31 14:35:03.000000000 -0700
+++ tesseract/api/pdfrenderer.cpp   2017-04-21 10:16:23.000000000 -0700
@@ -225,14 +225,10 @@
 // left-to-right no matter what the reading order is. We need the
 // word baseline in reading order, so we do that conversion here. Returns
 // the word's baseline origin and length.
-void GetWordBaseline(int writing_direction, int ppi, int height,
+void GetWordBaseline(int ppi, int height,
                      int word_x1, int word_y1, int word_x2, int word_y2,
                      int line_x1, int line_y1, int line_x2, int line_y2,
                      double *x0, double *y0, double *length) {
-  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
-    Swap(&word_x1, &word_x2);
-    Swap(&word_y1, &word_y2);
-  }
   double word_length;
   double x, y;
   {
@@ -260,15 +256,12 @@
 }

 // Compute coefficients for an affine matrix describing the rotation
-// of the text. If the text is right-to-left such as Arabic or Hebrew,
-// we reflect over the Y-axis. This matrix will set the coordinate
+// of the text. This matrix will set the coordinate
 // system for placing text in the PDF file.
 //
-//                           RTL
-// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
-// [ y' ]   [ c d ][ y ]   [ 0 1 ] [-sin cos ][ y ]
-void AffineMatrix(int writing_direction,
-                  int line_x1, int line_y1, int line_x2, int line_y2,
+// [ x' ] = [ a b ][ x ] = [ cos sin ][ x ]
+// [ y' ]   [ c d ][ y ]   [-sin cos ][ y ]
+void AffineMatrix(int line_x1, int line_y1, int line_x2, int line_y2,
                   double *a, double *b, double *c, double *d) {
   double theta = atan2(static_cast<double>(line_y1 - line_y2),
                        static_cast<double>(line_x2 - line_x1));
@@ -276,17 +269,6 @@
   *b = sin(theta);
   *c = -sin(theta);
   *d = cos(theta);
-  switch(writing_direction) {
-    case WRITING_DIRECTION_RIGHT_TO_LEFT:
-      *a = -*a;
-      *b = -*b;
-      break;
-    case WRITING_DIRECTION_TOP_TO_BOTTOM:
-      // TODO(jbreiden) Consider using the vertical PDF writing mode.
-      break;
-    default:
-      break;
-  }
 }

 // There are some really awkward PDF viewers in the wild, such as
@@ -407,15 +389,14 @@
     {
       int word_x1, word_y1, word_x2, word_y2;
       res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
-      GetWordBaseline(writing_direction, ppi, height,
+      GetWordBaseline(ppi, height,
                       word_x1, word_y1, word_x2, word_y2,
                       line_x1, line_y1, line_x2, line_y2,
                       &x, &y, &word_length);
     }

     if (writing_direction != old_writing_direction || new_block) {
-      AffineMatrix(writing_direction,
-                   line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
+      AffineMatrix(line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
       pdf_str.add_str_double(" ", prec(a));  // . This affine matrix
       pdf_str.add_str_double(" ", prec(b));  // . sets the coordinate
       pdf_str.add_str_double(" ", prec(c));  // . system for all
@@ -459,23 +440,34 @@
     bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
     STRING pdf_word("");
     int pdf_word_len = 0;
+    GenericVector<int> unicodes;
+
+    // Gather up unicode codepoints for the word
     do {
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != '\0') {
-        GenericVector<int> unicodes;
         UNICHAR::UTF8ToUnicode(grapheme, &unicodes);
-        char utf16[kMaxBytesPerCodepoint];
-        for (int i = 0; i < unicodes.length(); i++) {
-          int code = unicodes[i];
-          if (CodepointToUtf16be(code, utf16)) {
-            pdf_word += utf16;
-            pdf_word_len++;
-          }
-        }
       }
       delete []grapheme;
       res_it->Next(RIL_SYMBOL);
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
+
+
+    // Use primitive "write it backwards" approach for RTL languages
+    if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
+      unicodes.reverse();
+    }
+
+    // Write out the word the way PDF likes it
+    char utf16[kMaxBytesPerCodepoint];
+    for (int i = 0; i < unicodes.length(); i++) {
+      int codepoint = unicodes[i];
+      if (CodepointToUtf16be(codepoint, utf16)) {
+        pdf_word += utf16;
+        pdf_word_len++;
+      }
+    }
+
     if (word_length > 0 && pdf_word_len > 0 && fontsize > 0) {
       double h_stretch =
           kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));