Closed rnzucker closed 3 days ago
I'm having the same issue with transcripts. Some sections of dialogue are missing the first 1-3 lines when the speakers alternate in a conversation.
The conversational format is: Speaker1: Speaker2: Speaker1:
Has there been any progress on this issue? I'll poke around the package and see if anything jumps out.
@rnzucker Would it be ok for you if I added those files to PyPDF2 (Resouces) so that we can keep testing? (Under the Packages BSD license)
Totally fine. They are just snippets of newspaper articles.
Note to myself: The test-2 causes a newline where it shouldn't be. No text is missing (anymore).
The test-2.pdf is the following article of the New York Times from 2015: https://www.nytimes.com/2015/11/12/opinion/waiting-for-the-republican-shakeout.html -- I'm uncertain if we may add it.
this is the results with PR #1084 for test-2:
Watching Tuesday’s Republican presidential debate, with the eight prime -time contenders
talking over and past one another, the question arises: Should the party show a fe w of these
candidates the door?
Some fret that this mash -up lacks seriousness. The Republican National Committee says it won’t
intervene. It is relying on voters to usher also -rans off the national stage , and that may be a good
thing.
Americans won’t pay full attention to the presidenti al campaign for weeks. By the time they do,
debates and media exposure will have made for worthy vetting of these candidates’ attention -
getting but illogical tax plans, their dubious statements, and that most symbolic but ridiculous of
qualifications, thei r early biographies. Gov. Scott Walker’s exit suggests that fears of “super
PAC” money’s keeping flawed candida tes afloat may not materialize.
A number of conservative thinkers believe the shedding of vestigial candidates will happen soon
enough. In a com ing book, Henry Olsen of the Ethics and Public Policy Center in Washington
divides the Republican electorate into “four discrete factions that are based primarily on
ideology, with elements of class and religious background tempering that focus.”
The extra space are introduced with Tm repositioning. I don't have currently an easy solution to identify this as a 'simple' text repositioning without space.
According to https://github.com/py-pdf/pypdf/pull/2882#issuecomment-2391291908, this has just been fixed.
I've been trying out PyPDF2 and encountered cases where it is skipping text. It has no problem with one file (https://github.com/rnzucker/MadLib/blob/master/test-1.pdf), beyond adding newlines at 80 characters. But with another one (https://github.com/rnzucker/MadLib/blob/master/test-2.pdf, the beginning of a newspaper editorial), it starts with the "-time" from "prime-time" in the first line. It also skipped other text in the file. My code is very simple: