py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.13k stars 1.39k forks source link

BUG: Added line-breaks at dashes #234

Closed rnzucker closed 3 days ago

rnzucker commented 8 years ago

I've been trying out PyPDF2 and encountered cases where it is skipping text. It has no problem with one file (https://github.com/rnzucker/MadLib/blob/master/test-1.pdf), beyond adding newlines at 80 characters. But with another one (https://github.com/rnzucker/MadLib/blob/master/test-2.pdf, the beginning of a newspaper editorial), it starts with the "-time" from "prime-time" in the first line. It also skipped other text in the file. My code is very simple:

from PyPDF2 import PdfReader

reader = PdfReader("test-1.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()
print(text)
JeremyMMulcahey commented 8 years ago

I'm having the same issue with transcripts. Some sections of dialogue are missing the first 1-3 lines when the speakers alternate in a conversation.

The conversational format is: Speaker1: Speaker2: Speaker1:

Has there been any progress on this issue? I'll poke around the package and see if anything jumps out.

MartinThoma commented 2 years ago

@rnzucker Would it be ok for you if I added those files to PyPDF2 (Resouces) so that we can keep testing? (Under the Packages BSD license)

rnzucker commented 2 years ago

Totally fine. They are just snippets of newspaper articles.

MartinThoma commented 2 years ago

Note to myself: The test-2 causes a newline where it shouldn't be. No text is missing (anymore).

The test-2.pdf is the following article of the New York Times from 2015: https://www.nytimes.com/2015/11/12/opinion/waiting-for-the-republican-shakeout.html -- I'm uncertain if we may add it.

pubpub-zz commented 2 years ago

this is the results with PR #1084 for test-2:

Watching Tuesday’s Republican presidential debate, with the eight prime -time contenders 
talking over and past one another, the question arises: Should the party show a fe w of these 
candidates the door?  
Some fret that this mash -up lacks seriousness. The Republican National Committee says it won’t 
intervene. It is relying on voters to usher also -rans off the national stage , and that may be a good 
thing.  
Americans won’t pay full attention to the presidenti al campaign for weeks. By the time they do, 
debates and media exposure will have made for worthy vetting of these candidates’ attention -
getting but illogical tax plans, their dubious statements, and that most symbolic but ridiculous of 
qualifications, thei r early biographies. Gov. Scott Walker’s exit suggests that fears of “super 
PAC” money’s keeping flawed candida tes afloat may not materialize.  
A number of conservative thinkers believe the shedding of vestigial candidates will happen soon 
enough. In a com ing book, Henry Olsen of the Ethics and Public Policy Center in Washington 
divides the Republican electorate into “four discrete factions that are based primarily on 
ideology, with elements of class and religious background tempering that focus.”  

The extra space are introduced with Tm repositioning. I don't have currently an easy solution to identify this as a 'simple' text repositioning without space.

stefan6419846 commented 3 days ago

According to https://github.com/py-pdf/pypdf/pull/2882#issuecomment-2391291908, this has just been fixed.