py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.31k stars 1.41k forks source link

Line returns missing in text_extraction() #2138

Closed pubpub-zz closed 1 year ago

pubpub-zz commented 1 year ago

PDF file: https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf

Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space. the first page in my attached file. image

Originally posted by @yonglee7015 in https://github.com/py-pdf/pypdf/discussions/2135#discussioncomment-6872585

pubpub-zz commented 1 year ago

I've removed the whitespace:this deals with line return

MartinThoma commented 1 year ago

Whitespaces includes newlines. I just edited the description of the tag to make that explicit.

To me those space / newline issues look related as I think we touch similar parts of the code and the types of issues the users have is similar. Am I wrong with that?

pubpub-zz commented 1 year ago

the issue is coming from cm being modified at the "same time" as Tm:

      q
        1 0 0 1 2.125 0 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Company:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 0 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (AMERICAN EAGLE OUTFITTERS) ] TJ
        ET
      Q
      q
        1 0 0 1 2.125 13.85 cm
        0 g
        BT
          /F3 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (Division / Dept:) ] TJ
        ET
      Q
      q
        1 0 0 1 83.125 13.85 cm
        0 g
        BT
          /F1 8 Tf
          1 0 0 -1 0 8.969 Tm
          [ (50 / 170) ] TJ
        ET
      Q

in order to get the actual text position we need to compare tm.cm to tm_Prev.cm_prev (cm_prev is currently not saved)the big point is about the change merged from #2060 : we are passing tm_prev,but cm_matrix which is not consistent.

pubpub-zz commented 1 year ago

@yonglee7015, the PR is now OK if you want to testit

yonglee7015 commented 1 year ago

Yes,how can I test it?

MartinThoma commented 1 year ago

https://github.com/py-pdf/pypdf/pull/2142 is the PR

  1. Get the git repository: git clone https://github.com/pubpub-zz/PyPDF2.git pypdf-pubpub
  2. Go into the directory: cd pypdf-pubpub
  3. Checkout the branch: git checkout iss2138
  4. Install that version: pip install -e .
  5. Run your code with that version. Make sure you really use that version and not e.g. have a different environment

I think I should document this somewhere :thinking:

pubpub-zz commented 1 year ago

@MartinThoma does this trick works ? pip install git+https://github.com/pubpub-zz/PyPDF2.git@iss2138

MartinThoma commented 1 year ago

Yes! I completely forgot about that!

MartinThoma commented 1 year ago

By the way: Could you please rename it from PyPDF2 to pypdf? It might be confusing to others if they see PyPDF2.

pubpub-zz commented 1 year ago

oups. used to not know how to do it @yonglee7015 the instructions shoud be now be: pip install git+https://github.com/pubpub-zz/pypdf.git@iss2138

yonglee7015 commented 1 year ago

Thanks 😊I will test it

yonglee7015 commented 1 year ago

HI @pubpub-zz Yes, it works. Thanks for your help.

Can you also test this pdf file? the page 3. You will find the order of extracted text is not correct.

1,the first line of text in pdf goes to the last line in the output text. image image

2, the order of text in table is not correct image

Can you fixed this?

I try another library tika-python, their text in table order is correct. but the first line also goes to the last line in the output text as yours. image

test.pdf

pubpub-zz commented 1 year ago

You have reached the limit of pypdf current implementation: a) strings are extracted in the order they have been "inserted" inside the document. when you print a document they are printed top from bottom, but in a pdf its more likely like a 2D plotter which can draw top left then bottom right before reaching the middle. extract_text get the text in the order they are plotted so the order is not garanted. It is far much more difficult in your case as you are working on documents.

Sorry there is no solution for the moment with pypdf.😞

yonglee7015 commented 1 year ago

Oh,no. It's so pity. Thank you.

stefan6419846 commented 1 year ago

I have not tested it, but shouldn't a visitor be able to fix the order on the user side in this case? https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor

pubpub-zz commented 1 year ago

It is more complex: you need to know if there is some columns what are the coordinates... Maybe ai could help...