linmaogithub / pdfium

Automatically exported from code.google.com/p/pdfium
0 stars 0 forks source link

trouble extracting 90 degree rotated text #199

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Open fails.pdf and works.pdf in chrome
2. Select all text using ctrl-a

What is the expected output? What do you see instead?

Expected: both files show a small highlighted area (see works.png)
Actual: fails.pdf does not highlight anything (see fails.png)

These files contain a single line of rotated text. However, fails.pdf is 
completely vertical, while works.pdf is ALMOST vertical.

   0 -1 1 0 200 500 Tm            # fails.pdf
   0.001 -1 1 0 200 500 Tm        # works.pdf

What version of the product are you using? On what operating system?

 Google Chrome 44.0.2403.157 on Ubuntu Linux

 I don't know the PDFium version I used for programmatic text extraction, but it was updated in the last week.

Please provide any additional information below.

I've done some experimentation to try to isolate what is going on. 

You'll notice that there is a weird invisible font in the PDF. If I take out 
that special font, the problem doesn't reproduce. (see does-not-reproduce.pdf)

I've also compared shorter and longer lines of text. Short text require more 
tilt to work. This gives some insight as to what might be going on.

  0.001 -1 1 0 200 500 Tm 187.2 Tz [ (Hello) ] TJ    # Works
  0.001 -1 1 0 200 500 Tm 187.2 Tz [ (Hi) ] TJ       # Fails
  0.002 -1 1 0 200 500 Tm 187.2 Tz [ (Hi) ] TJ       # Works

I have also run some manual text extraction commands with PDFium and they have 
similar results. This suggests Chrome is innocent and the problem reproduces 
directly in PDFium.

    FPDF_InitLibrary();
    FPDF_DOCUMENT pdf = FPDF_LoadDocument(pdf_filename.c_str(), NULL);
    FPDF_PAGE page = FPDF_LoadPage(pdf, 0);
    FPDF_TEXTPAGE textpage = FPDFText_LoadPage(page);
    int chars_count = FPDFText_CountChars(textpage);
    for (int i = 0; i <= chars_count; i++) {
       unsigned int c = FPDFText_GetUnicode(textpage, i);
       fprintf(stderr, "%d", c);
    }
    FPDFText_ClosePage(textpage);
    FPDF_ClosePage(page);
    FPDF_CloseDocument(pdf);
    FPDF_DestroyLibrary();

Finally, poppler has no trouble extracting the text, using the command pdftotex.

Why does this matter? There are OCR programs that are working with vertical 
text (see ja-vert.pdf) and are producing PDF results that do not work well with 
PDFium.  As someone who works on these OCR programs, I have 100% control over 
how the PDF is being produced, so if you think that the PDF itself is at fault, 
please let me know. Otherwise, please see if PDFium can be improved.

Original issue reported on code.google.com by breidenb...@gmail.com on 4 Sep 2015 at 12:01

GoogleCodeExporter commented 8 years ago

Original comment by breidenb...@gmail.com on 4 Sep 2015 at 12:02

Attachments:

GoogleCodeExporter commented 8 years ago
Here are the works/fails screenshots. (Ignore the incorrect on in previous 
comment)

Original comment by breidenb...@gmail.com on 4 Sep 2015 at 12:06

Attachments:

GoogleCodeExporter commented 8 years ago

Original comment by thestig@chromium.org on 4 Sep 2015 at 11:22

GoogleCodeExporter commented 8 years ago
FPDFText_CountChars() is returning 0 in the failure case.

Original comment by jbrei...@google.com on 4 Sep 2015 at 5:35

GoogleCodeExporter commented 8 years ago
I think the problem is in this file, which is full of thresholds like 0.01 and 
0.001 that are probably getting confused by rotated text.

https://pdfium.googlesource.com/pdfium/+/master/core/src/fpdftext/fpdf_text_int.
cpp

Original comment by jbrei...@google.com on 4 Sep 2015 at 5:42

GoogleCodeExporter commented 8 years ago
I've isolated the problem down the the thresholding at the beginning of both 
ProcessTextObject() methods. They are at line 1480 and 1246 of 
fpdf_text_int.cpp.

 if (FXSYS_fabs(pTextObj->m_Right - pTextObj->m_Left) < 0.01f) {
    return;
 }

Original comment by jbrei...@google.com on 4 Sep 2015 at 6:05

GoogleCodeExporter commented 8 years ago
This patch does the trick for my document, but I don't know if this will expose 
us to trouble from malicious documents.

Original comment by jbrei...@google.com on 4 Sep 2015 at 6:53

Attachments:

GoogleCodeExporter commented 8 years ago
Ready for consideration by PDFium maintainers.

Original comment by jbrei...@google.com on 8 Sep 2015 at 7:25

GoogleCodeExporter commented 8 years ago
Whoops, missed your patch. Please use codereview.chromium.org to upload the 
patch, rather than attaching it here.

Original comment by thestig@chromium.org on 9 Sep 2015 at 7:34

GoogleCodeExporter commented 8 years ago
I am attaching another test image, that contains two single character
text objects. I hope that this is useful for testing. 

BT
3 Tr -0 -1 1 -0 93.296 451.268 Tm /f-0-0 27 Tf 136.15 Tz [ <006D> ] TJ 
ET
BT
3 Tr 1 0 0 1 406.648 15.211 Tm /f-0-0 27 Tf 114.554 Tz [ <0074> ] TJ 
ET

Original comment by jbrei...@google.com on 11 Sep 2015 at 8:24

Attachments: