Open dl-racing opened 2 years ago
Hi @dl-racing
I just gave this a try:
import PyPDF2
print(f"PyPDF2=={PyPDF2.__version__}\n\n")
reader = PyPDF2.PdfReader("missing_newlines.pdf")
print(reader.pages[6].extract_text())
which gives me
PyPDF2==2.11.1
2022 Intelligent Money British GT Championship
TEST SESSION 1 - SECTOR ANALYSIS
SECTOR 1 = FL to I1, SECTOR 2 = I1 to I2, SECTOR 3 = I2 to FL, DIFF = Difference To Personal Best Lap, P = Crossed F inish Line in Pit Lane, D = Time Disallowed
77 Enduro Motorsport P1
LAP LAP TIME DIFF TIME OF DAY SECTOR 1 SECTOR 2 SECTOR 3GT3PA McLaren 720S GT3
MPHIDEAL LAP TIME : 1:26.866 BEST LAP TIME : 1:26.942 DIFFERENCE : 0.076
D1: Morgan TILLBROOK D2: Marcus CLUTTON
1 - D1 11:03:12.126 OUTLAP 116.7 37.750 139.2 35.381 101.0
2 - D1 1:28.713 1.771 11:04:40.839 19.626 144.6 34.707 140.9 34.380 100.4 100.93
3 - D1 1:28.523 1.581 11:06:09.362 19.636 134.4 34.731 140.9 34.156 101.8 101.15
4 - D1 1:27.561 0.619 11:07:36.923 19.362 145.8 34.269 140.9 33.930 101.2 102.26
5 - D1 1:27.515 0.573 11:09:04.438 19.200 146.5 34.323 140.9 33.992 100.7 102.31
6 - D1 1:29.908 2.966 11:10:34.346 P 19.302 146.2 34.527 141.8 IN PIT 99.59
7 - D1 5:09.987 3:43.045 11:15:44.333 OUTLAP 125.9 35.711 140.6 34.663 101.3 28.88
8 - D1 1:28.833 1.891 11:17:13.166 19.579 118.9 35.110 140.3 34.144 101.9 100.80
9 - D1 1:27.324 0.382 11:18:40.490 19.215 145.8 34.132 141.2 33.977 101.6 102.54
10 - D1 1:27.312 0.370 11:20:07.802 (3) 19.167 145.8 34.215 141.2 33.930 101.3 102.55
11 - D1 1:28.904 1.962 11:21:36.706 P 19.194 147.1 34.112 141.5 IN PIT 100.72
12 - D1 3:20.279 1:53.337 11:24:56.985 OUTLAP 142.4 34.842 141.8 41.436 102.7 44.70
13 - D1 1:27.303 0.361 11:26:24.288 (2) 19.093 146.8 33.945 142.1 34.265 101.6 102.56
14 - D1 1:26.942 11:27:51.230 (1) 19.116 145.8 33.990 142.1 33.836 101.5 102.99
15 - D1 1:29.334 2.392 11:29:20.564 P 19.085 146.2 34.645 141.5 IN PIT 100.23
16 - D1 6:47.450 5:20.508 11:36:08.014 OUTLAP 133.9 36.242 139.5 35.600 99.5 21.97
17 - D1 1:32.720 5.778 11:37:40.734 19.852 139.2 37.286 139.8 35.582 100.1 96.57
18 - D1 1:30.155 3.213 11:39:10.889 19.622 143.0 35.781 138.0 34.752 100.1 99.32
19 - D1 1:32.253 5.311 11:40:43.142 19.456 144.9 36.385 141.5 36.412 97.9 97.06
20 - D1 1:30.489 3.547 11:42:13.631 19.514 144.9 34.885 141.2 36.090 96.4 98.95
21 - D1 1:32.417 5.475 11:43:46.048 20.315 133.4 37.147 139.5 34.955 101.5 96.89
22 - D1 1:29.515 2.573 11:45:15.563 19.366 145.2 34.815 140.9 35.334 100.9 100.03
23 - D1 1:31.117 4.175 11:46:46.680 19.308 144.9 36.004 140.9 35.805 101.6 98.27
24 - D1 1:41.788 14.846 11:48:28.468 19.251 144.3 46.134 139.5 36.403 102.9 87.97
25 - D1 1:28.198 1.256 11:49:56.666 19.328 144.6 34.575 140.9 34.295 100.4 101.52
26 - D1 1:30.756 3.814 11:51:27.422 19.288 145.2 36.824 140.3 34.644 100.6 98.66
27 - D1 1:29.093 2.151 11:52:56.515 19.644 143.7 35.093 141.2 34.356 101.3 100.50
28 - D1 1:28.870 1.928 11:54:25.385 19.250 145.5 35.408 141.8 34.212 101.5 100.75
29 - D1 1:29.468 2.526 11:55:54.853 19.294 146.5 34.896 141.8 35.278 101.5 100.08
Results can be found at www.tsl-timing.com Page 1 of 16 Printed - 12:02 Thursday, 26 May 2022Date: 26/05/2022 Start: 11:00 Finish: 11:55Weather / Track : / Donington Park GP: 2.4873 miles
That actually looks fine. Could it be that you're using an older PyPDF2 version?
Please try pip install PyPDF2 --upgrade
to check :-)
@dl-racing For your project, a layout-preserving text extraction might be the best fit. pdftotext
from https://poppler.freedesktop.org/ offers that:
pdftotext -layout -f 7 -l 7 missing_newlines.pdf
gives
2022 Intelligent Money British GT Championship
TEST SESSION 1 - SECTOR ANALYSIS
SECTOR 1 = FL to I1, SECTOR 2 = I1 to I2, SECTOR 3 = I2 to FL, DIFF = Difference To Personal Best Lap, P = Crossed Finish Line in Pit Lane, D = Time Disallowed
P1 77 GT3PA Enduro Motorsport McLaren 720S GT3
IDEAL LAP TIME : 1:26.866 BEST LAP TIME : 1:26.942 DIFFERENCE : 0.076
D1: Morgan TILLBROOK D2: Marcus CLUTTON
LAP SECTOR 1 SECTOR 2 SECTOR 3 LAP TIME MPH DIFF TIME OF DAY
1 - D1 OUTLAP 116.7 37.750 139.2 35.381 101.0 11:03:12.126
2 - D1 19.626 144.6 34.707 140.9 34.380 100.4 1:28.713 100.93 1.771 11:04:40.839
3 - D1 19.636 134.4 34.731 140.9 34.156 101.8 1:28.523 101.15 1.581 11:06:09.362
4 - D1 19.362 145.8 34.269 140.9 33.930 101.2 1:27.561 102.26 0.619 11:07:36.923
5 - D1 19.200 146.5 34.323 140.9 33.992 100.7 1:27.515 102.31 0.573 11:09:04.438
6 - D1 19.302 146.2 34.527 141.8 IN PIT 1:29.908 P 99.59 2.966 11:10:34.346
7 - D1 OUTLAP 125.9 35.711 140.6 34.663 101.3 5:09.987 28.88 3:43.045 11:15:44.333
8 - D1 19.579 118.9 35.110 140.3 34.144 101.9 1:28.833 100.80 1.891 11:17:13.166
9 - D1 19.215 145.8 34.132 141.2 33.977 101.6 1:27.324 102.54 0.382 11:18:40.490
10 - D1 19.167 145.8 34.215 141.2 33.930 101.3 1:27.312 (3) 102.55 0.370 11:20:07.802
11 - D1 19.194 147.1 34.112 141.5 IN PIT 1:28.904 P 100.72 1.962 11:21:36.706
12 - D1 OUTLAP 142.4 34.842 141.8 41.436 102.7 3:20.279 44.70 1:53.337 11:24:56.985
13 - D1 19.093 146.8 33.945 142.1 34.265 101.6 1:27.303 (2) 102.56 0.361 11:26:24.288
14 - D1 19.116 145.8 33.990 142.1 33.836 101.5 1:26.942 (1) 102.99 11:27:51.230
15 - D1 19.085 146.2 34.645 141.5 IN PIT 1:29.334 P 100.23 2.392 11:29:20.564
16 - D1 OUTLAP 133.9 36.242 139.5 35.600 99.5 6:47.450 21.97 5:20.508 11:36:08.014
17 - D1 19.852 139.2 37.286 139.8 35.582 100.1 1:32.720 96.57 5.778 11:37:40.734
18 - D1 19.622 143.0 35.781 138.0 34.752 100.1 1:30.155 99.32 3.213 11:39:10.889
19 - D1 19.456 144.9 36.385 141.5 36.412 97.9 1:32.253 97.06 5.311 11:40:43.142
20 - D1 19.514 144.9 34.885 141.2 36.090 96.4 1:30.489 98.95 3.547 11:42:13.631
21 - D1 20.315 133.4 37.147 139.5 34.955 101.5 1:32.417 96.89 5.475 11:43:46.048
22 - D1 19.366 145.2 34.815 140.9 35.334 100.9 1:29.515 100.03 2.573 11:45:15.563
23 - D1 19.308 144.9 36.004 140.9 35.805 101.6 1:31.117 98.27 4.175 11:46:46.680
24 - D1 19.251 144.3 46.134 139.5 36.403 102.9 1:41.788 87.97 14.846 11:48:28.468
25 - D1 19.328 144.6 34.575 140.9 34.295 100.4 1:28.198 101.52 1.256 11:49:56.666
26 - D1 19.288 145.2 36.824 140.3 34.644 100.6 1:30.756 98.66 3.814 11:51:27.422
27 - D1 19.644 143.7 35.093 141.2 34.356 101.3 1:29.093 100.50 2.151 11:52:56.515
28 - D1 19.250 145.5 35.408 141.8 34.212 101.5 1:28.870 100.75 1.928 11:54:25.385
29 - D1 19.294 146.5 34.896 141.8 35.278 101.5 1:29.468 100.08 2.526 11:55:54.853
Weather / Track : / Donington Park GP: 2.4873 miles
Date: 26/05/2022 Start: 11:00 Finish: 11:55
Results can be found at www.tsl-timing.com Page 1 of 16 Printed - 12:02 Thursday, 26 May 2022
@pubpub-zz / @srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well?
I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout (not tables, but actual text columns).
For tables, I think the layout preserving mode is pretty much always desirable. However, I don't see how we could reliably detect that there is a table.
Many thanks Martin, both files now work correctly. FWIW I think layout preservation is very important as the layout often carries the meaning/context for a piece of data. The '37.750' is part of the 'SECTOR 2' class of data by virtue of it's position directly underneath the phrase 'SECTOR 2'. If 'SECTOR 2' was sat underneath another field, the 37.750 would inherit not only 'SECTOR 2' but also the field above that also.
On Fri, Oct 14, 2022 at 7:01 PM Martin Thoma @.***> wrote:
@pubpub-zz https://github.com/pubpub-zz / @srogmann https://github.com/srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well?
I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout.
— Reply to this email directly, view it on GitHub https://github.com/py-pdf/PyPDF2/issues/1395#issuecomment-1279306716, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3TR72EW2U6I6WTLIXN7HK3WDGNXHANCNFSM6AAAAAARFMRKZ4 . You are receiving this because you were mentioned.Message ID: @.***>
I'm happy to hear that it works now!
Just to make sure I've got it right: The upgrade of PyPDF2 did the trick with the newlines, right? So the newlines work, but the whitespace is still something we could improve. Right?
I believe so. I didn’t document the output annoyingly and I haven’t been able to replicate since…
White space and for sure, the ordering/layout. The column order changes and because of white space being truncated you can’t detect missing values in place…
On Fri, 14 Oct 2022 at 20:08, Martin Thoma @.***> wrote:
Just to make sure I've got it right: The upgrade of PyPDF2 did the trick with the newlines, right? So the newlines work, but the whitespace is still something we could improve. Right?
— Reply to this email directly, view it on GitHub https://github.com/py-pdf/PyPDF2/issues/1395#issuecomment-1279362807, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3TR72G7BTYGNPGRVCFRPGTWDGVSLANCNFSM6AAAAAARFMRKZ4 . You are receiving this because you were mentioned.Message ID: @.***>
@pubpub-zz / @srogmann Just out of curiosity: Do you think such a layout-preserving mode could be possible with PyPDF2 as well?
I'm uncertain what that would entail and how often users would prefer it compared to the current "reading-flow" extraction mode. This is especially important when there is a multi-column layout (not tables, but actual text columns).
For tables, I think the layout preserving mode is pretty much always desirable. However, I don't see how we could reliably detect that there is a table.
I hope to be able to do so (that was part of my roadmap in https://github.com/py-pdf/PyPDF2/discussions/1181#discussioncomment-3413144) Just finishing my current PR and this will be my next job😀
Very nice!
I'm closing this issue now as the original problem was solved by upgrading. I'll use the files to create a test / benchmark so that we can track our progress in the layout presentation area :-)
Thank you @dl-racing and @pubpub-zz for your input and the nice discussion ❤️
I've come across another erroneous example (even with the upgraded library).
Page 8, Free Practice 1 SECTOR ANALYSIS (I've attached the page of interest, but the full PDF is available here: https://www.tsl-timing.com/file/?f=BF3GT/2022/221805bgt.pdf)
page_8_extracted_from_full_pdf
@MartinThoma I've posted here instead of opening a new ticket as keeping the two cases together might be useful...can we reopen this ticket?
pdftotext works very well for my use case, but I'd like to help fix this case for pypdf2 :)
The blank issue has been resolved for this correspondence. It is necessary to consider placement by position, not by text input order (BT, ET order). I think it is possible to change the simple addition of output to a list with row and column position information, but is this something you would like to see addressed?
If so, is this a new feature?
Sorry, you mentioned a bug with whitespace in layout mode. My mistake.
I'm raising this issue as a result of a super useful (and helpful!) chat with @MartinThoma.
For simplicity, I am trying to extract the first page of the 'SECTOR ANALYSIS' sections from both the attached PDFs.
One file (correct_newlines.pdf) produces each row as expected as a new line of text (albeit the columns are in a different but consistent order).
The other file (missing_newlines.pdf) has very similar data but produces fewer lines of text, with multiple lines concatenated without spaces between.
correct_newlines.pdf missing_newlines.pdf