Read pdf asynchronously

summarizepaper commented 1 year ago

Explanation

I tried to read a pdf file with PyPDF2 in asynchronous mode but it didn't work.

I tried using the aiofiles open-source library accessible on GitHub with async with aiofiles.open(pdf_filename, 'rb') as file:

but then the PyPDF2 functions would return an error.

Am I doing something wrong or async is not implemented?

pubpub-zz commented 1 year ago

can you please provide a standalone code and the error reported.

summarizepaper commented 1 year ago

Actually, I managed to make it work like so:

import aiofiles
import PyPDF2
import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = PyPDF2.PdfFileReader(pdf_stream)
    text = ""
    for page in range(pdf_reader.getNumPages()):
        text += pdf_reader.getPage(page).extractText()
print('text',text)

but then when I print the text from this pdf: https://arxiv.org/pdf/2304.01202v1.pdf

it gives that for the beginning:

text manuscriptNo.
(willbeinsertedbytheeditor)
WaveMechanics,Interference,andDecoherenceinStrongGravitational
Lensing
CalvinLeung

DylanJow

PrasenjitSaha

LiangDai

MasamuneOguri

Léon
V.E.Koopmans
Abstract
Wave-mechanicale

ectsingravitationallensing
havelongbeenpredicted,andwiththediscoveryofpopula-
tionsofcompacttransientssuchasgravitationalwaveevents
andfastradiobursts,maysoonbeobserved.Wepresentan
observer'sreviewoftherelevanttheoryunderlyingwave-
mechanicale

ectsingravitationallensing.Startingfromthe
curved-spacetimescalarwaveequation,wederivetheFresnel-
Kircho

di

ractionintegral,andanalyzeitintheeikonal
andwaveopticsregimes.Weanswerthequestionofwhat
makesinterferencee

ectsobservableinsomesystemsbut
notinothers,andhowinterferencee

ectsallowforcomple-
mentaryinformationtobeextractedfromlensingsystems
ascomparedtotraditionalmeasurements.Weendbydis-
cussinghowdi

why is it so bad?

MartinThoma commented 1 year ago

You are using an extremely outdated version. Uninstall PyPDF2. Use pypdf.

MartinThoma commented 1 year ago

I'm closing this issue as there seems nothing to do here.

pubpub-zz commented 1 year ago

To complete I've re-run the test without text scrambling

summarizepaper commented 1 year ago

Thanks. I compared to pdfminer, which I use (but doesn't work in async) and the formula given by pypdf are not as good. The text is also better from pdfminer.six. Am I doing something wrong or is it just better?

MartinThoma commented 1 year ago

You are doing something wrong:

You're using PyPDF2 and not pypdf
You're additionally using a super outdated version, probably PyPDF2<=1.26.0. That is more than 7 years old.

summarizepaper commented 1 year ago

No I meant, I have updated the code to the most recent pypdf and I find it not as good as pdfminer. Here is the new code:

import aiofiles import pypdf import io

async with aiofiles.open(pdf_filename, "rb") as f:
    pdf_data = await f.read()
    pdf_stream = io.BytesIO(pdf_data)
    pdf_reader = pypdf.PdfReader(pdf_stream)
    text = ""
    for num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[num]
        text += page.extract_text(0)

MartinThoma commented 1 year ago

@summarizepaper What you write is pretty confusing. pypdf is way better than what you have shared in the excerpt. that is what I get:

manuscript No.
(will be inserted by the editor)
Wave Mechanics, Interference, and Decoherence in Strong Gravitational
Lensing
Calvin Leung Dylan Jow Prasenjit Saha Liang Dai Masamune Oguri Léon
V . E. Koopmans
Abstract Wave-mechanical e ects in gravitational lensing
have long been predicted, and with the discovery of popula-
tions of compact transients such as gravitational wave events
and fast radio bursts, may soon be observed. We present an
observer’s review of the relevant theory underlying wave-
mechanical e ects in gravitational lensing. Starting from the
curved-spacetime scalar wave equation, we derive the Fresnel-
Kircho diraction integral, and analyze it in the eikonal
and wave optics regimes. We answer the question of what
makes interference e ects observable in some systems but
not in others, and how interference e ects allow for comple-
mentary information to be extracted from lensing systems
as compared to traditional measurements. We end by dis-
cussing how di raction e ects a ect optical depth forecasts
and lensing near caustics, and how compact, low-frequency
transients like gravitational waves and fast radio bursts pro-
vide promising paths to open up the frontier of interferomet-
ric gravitational lensing.
Keywords Gravitational lensing, wave optics, gravitational
waves, transients
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 The curved-spacetime scalar wave equation . . . . . . . . . 2
3 Di erent Regimes in Wave Optical Gravitational Lensing . . 5
3.1 Beyond Scalar Wave Optics . . . . . . . . . . . . . . . 7
4 Eikonal Optics . . . . . . . . . . . . . . . . . . . . . . . . . 8
[cropped a lot more]

It's also highly unlikely that there are any issues with reading PDF asynchronously. Also I have no idea what error message you mean.

Comparison of pypdf and pdfminer

I looked at the difference between pdfminder and pypdf.

Left is pypdf, right is pdfminer:

What pdfminer does better:

Math mode ("formula") is extracted better
It handles ligatures better (ff)

What pypdf does better:

Extracting the "arxiv" text on the side
Extracting the "contents" section

@pubpub-zz Do you want to investigate that further?

pubpub-zz commented 1 year ago

@pubpub-zz Do you want to investigate that further?

I did a quick comparison with the extraction from acrobat reader. The results are similar:

the ff digraph is not extracted as ff there (same as pypdf) => the pdf is not including the correct translation of the digraph character
the "." between is authors is extracted as a special character (same as pypdf) => the pdf is not including the correct translation of the digraph
the arvix text is extracted at the same position and as a single sentence.

My opinion is that pypdf is extracting the text as it is described within the pdf.

pubpub-zz commented 1 year ago

@MartinThoma I propose to close it

summarizepaper commented 1 year ago

Acrobat reader is reading the ff correctly. If pdfminer is able to get ff right, there should be a way. These arxiv pdfs are compiled from latex which is creating a nice formatted ff (with correct typography) which is different from having two f next to each other. As pdf miner can do it, it would be weird not to do it in pypdf. For instance, I'd like to use pypdf but if it cannot read ff and equations correctly then it's a problem.

For the equations, here is what I get from pypdf for the first two equations: 0=gabrarb: (1) rb=@b: (2)

and from pdfminer:

0 = gab∇a∇bφ.

(1)

∇bφ = ∂bφ.

(2)

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

pubpub-zz commented 1 year ago

Acrobat reader is reading the ff correctly.

You have to compare output from copy/paste not display

Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?

again: compare output from clipboard.

summarizepaper commented 1 year ago

Indeed, a copy paste from Acrobat reader displays what pypdf gives but the pdf reader from Mac or from the browser display what pdfminer gives. I'm reading the pdf and showing them on my website so I really need a library that can do the job as good as pdfminer. I don't think Acrobat is a reference anymore and at least they show the right ff and equations in the reader and the pypdf "python" reader should do the same in my opinion and give a text that can be understood and read by humans.

MartinThoma commented 1 year ago

I'm closing this as "not planned" as we simply don't have anybody to work on this.

However, I will add a new benchmark looking at math text extraction: https://github.com/py-pdf/benchmarks/issues/6

py-pdf / pypdf

Read pdf asynchronously #1789

Explanation

Comparison of pypdf and pdfminer