Closed pubpub-zz closed 1 year ago
I've removed the whitespace:this deals with line return
Whitespaces includes newlines. I just edited the description of the tag to make that explicit.
To me those space / newline issues look related as I think we touch similar parts of the code and the types of issues the users have is similar. Am I wrong with that?
the issue is coming from cm being modified at the "same time" as Tm:
q
1 0 0 1 2.125 0 cm
0 g
BT
/F3 8 Tf
1 0 0 -1 0 8.969 Tm
[ (Company:) ] TJ
ET
Q
q
1 0 0 1 83.125 0 cm
0 g
BT
/F1 8 Tf
1 0 0 -1 0 8.969 Tm
[ (AMERICAN EAGLE OUTFITTERS) ] TJ
ET
Q
q
1 0 0 1 2.125 13.85 cm
0 g
BT
/F3 8 Tf
1 0 0 -1 0 8.969 Tm
[ (Division / Dept:) ] TJ
ET
Q
q
1 0 0 1 83.125 13.85 cm
0 g
BT
/F1 8 Tf
1 0 0 -1 0 8.969 Tm
[ (50 / 170) ] TJ
ET
Q
in order to get the actual text position we need to compare tm.cm to tm_Prev.cm_prev (cm_prev is currently not saved)the big point is about the change merged from #2060 : we are passing tm_prev,but cm_matrix which is not consistent.
@yonglee7015, the PR is now OK if you want to testit
Yes,how can I test it?
https://github.com/py-pdf/pypdf/pull/2142 is the PR
git clone https://github.com/pubpub-zz/PyPDF2.git pypdf-pubpub
cd pypdf-pubpub
git checkout iss2138
pip install -e .
I think I should document this somewhere :thinking:
@MartinThoma does this trick works ?
pip install git+https://github.com/pubpub-zz/PyPDF2.git@iss2138
Yes! I completely forgot about that!
By the way: Could you please rename it from PyPDF2 to pypdf? It might be confusing to others if they see PyPDF2.
oups. used to not know how to do it
@yonglee7015
the instructions shoud be now be:
pip install git+https://github.com/pubpub-zz/pypdf.git@iss2138
Thanks 😊I will test it
HI @pubpub-zz Yes, it works. Thanks for your help.
Can you also test this pdf file? the page 3. You will find the order of extracted text is not correct.
1,the first line of text in pdf goes to the last line in the output text.
2, the order of text in table is not correct
Can you fixed this?
I try another library tika-python, their text in table order is correct. but the first line also goes to the last line in the output text as yours.
You have reached the limit of pypdf current implementation: a) strings are extracted in the order they have been "inserted" inside the document. when you print a document they are printed top from bottom, but in a pdf its more likely like a 2D plotter which can draw top left then bottom right before reaching the middle. extract_text get the text in the order they are plotted so the order is not garanted. It is far much more difficult in your case as you are working on documents.
Sorry there is no solution for the moment with pypdf.😞
Oh,no. It's so pity. Thank you.
I have not tested it, but shouldn't a visitor be able to fix the order on the user side in this case? https://pypdf.readthedocs.io/en/latest/user/extract-text.html#using-a-visitor
It is more complex: you need to know if there is some columns what are the coordinates... Maybe ai could help...
PDF file: https://github.com/py-pdf/pypdf/files/12483807/AEO.1172.pdf
Can you also test the page.extract_text() function? It seems always combine sentences in multiline without space. the first page in my attached file.
Originally posted by @yonglee7015 in https://github.com/py-pdf/pypdf/discussions/2135#discussioncomment-6872585