Closed summarizepaper closed 1 year ago
can you please provide a standalone code and the error reported.
Actually, I managed to make it work like so:
import aiofiles
import PyPDF2
import io
async with aiofiles.open(pdf_filename, "rb") as f:
pdf_data = await f.read()
pdf_stream = io.BytesIO(pdf_data)
pdf_reader = PyPDF2.PdfFileReader(pdf_stream)
text = ""
for page in range(pdf_reader.getNumPages()):
text += pdf_reader.getPage(page).extractText()
print('text',text)
but then when I print the text from this pdf: https://arxiv.org/pdf/2304.01202v1.pdf
it gives that for the beginning:
text manuscriptNo.
(willbeinsertedbytheeditor)
WaveMechanics,Interference,andDecoherenceinStrongGravitational
Lensing
CalvinLeung
DylanJow
PrasenjitSaha
LiangDai
MasamuneOguri
Léon
V.E.Koopmans
Abstract
Wave-mechanicale
ectsingravitationallensing
havelongbeenpredicted,andwiththediscoveryofpopula-
tionsofcompacttransientssuchasgravitationalwaveevents
andfastradiobursts,maysoonbeobserved.Wepresentan
observer'sreviewoftherelevanttheoryunderlyingwave-
mechanicale
ectsingravitationallensing.Startingfromthe
curved-spacetimescalarwaveequation,wederivetheFresnel-
Kircho
di
ractionintegral,andanalyzeitintheeikonal
andwaveopticsregimes.Weanswerthequestionofwhat
makesinterferencee
ectsobservableinsomesystemsbut
notinothers,andhowinterferencee
ectsallowforcomple-
mentaryinformationtobeextractedfromlensingsystems
ascomparedtotraditionalmeasurements.Weendbydis-
cussinghowdi
why is it so bad?
You are using an extremely outdated version. Uninstall PyPDF2
. Use pypdf
.
I'm closing this issue as there seems nothing to do here.
To complete I've re-run the test without text scrambling
Thanks. I compared to pdfminer, which I use (but doesn't work in async) and the formula given by pypdf are not as good. The text is also better from pdfminer.six. Am I doing something wrong or is it just better?
You are doing something wrong:
PyPDF2
and not pypdf
PyPDF2<=1.26.0
. That is more than 7 years old.No I meant, I have updated the code to the most recent pypdf and I find it not as good as pdfminer. Here is the new code:
import aiofiles import pypdf import io
async with aiofiles.open(pdf_filename, "rb") as f:
pdf_data = await f.read()
pdf_stream = io.BytesIO(pdf_data)
pdf_reader = pypdf.PdfReader(pdf_stream)
text = ""
for num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[num]
text += page.extract_text(0)
@summarizepaper What you write is pretty confusing. pypdf is way better than what you have shared in the excerpt. that is what I get:
manuscript No.
(will be inserted by the editor)
Wave Mechanics, Interference, and Decoherence in Strong Gravitational
Lensing
Calvin Leung Dylan Jow Prasenjit Saha Liang Dai Masamune Oguri Léon
V . E. Koopmans
Abstract Wave-mechanical e ects in gravitational lensing
have long been predicted, and with the discovery of popula-
tions of compact transients such as gravitational wave events
and fast radio bursts, may soon be observed. We present an
observer’s review of the relevant theory underlying wave-
mechanical e ects in gravitational lensing. Starting from the
curved-spacetime scalar wave equation, we derive the Fresnel-
Kircho diraction integral, and analyze it in the eikonal
and wave optics regimes. We answer the question of what
makes interference e ects observable in some systems but
not in others, and how interference e ects allow for comple-
mentary information to be extracted from lensing systems
as compared to traditional measurements. We end by dis-
cussing how di raction e ects a ect optical depth forecasts
and lensing near caustics, and how compact, low-frequency
transients like gravitational waves and fast radio bursts pro-
vide promising paths to open up the frontier of interferomet-
ric gravitational lensing.
Keywords Gravitational lensing, wave optics, gravitational
waves, transients
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 The curved-spacetime scalar wave equation . . . . . . . . . 2
3 Di erent Regimes in Wave Optical Gravitational Lensing . . 5
3.1 Beyond Scalar Wave Optics . . . . . . . . . . . . . . . 7
4 Eikonal Optics . . . . . . . . . . . . . . . . . . . . . . . . . 8
[cropped a lot more]
It's also highly unlikely that there are any issues with reading PDF asynchronously. Also I have no idea what error message you mean.
I looked at the difference between pdfminder and pypdf.
Left is pypdf, right is pdfminer:
What pdfminer does better:
What pypdf does better:
@pubpub-zz Do you want to investigate that further?
@pubpub-zz Do you want to investigate that further?
I did a quick comparison with the extraction from acrobat reader. The results are similar:
My opinion is that pypdf is extracting the text as it is described within the pdf.
@MartinThoma I propose to close it
Acrobat reader is reading the ff correctly. If pdfminer is able to get ff right, there should be a way. These arxiv pdfs are compiled from latex which is creating a nice formatted ff (with correct typography) which is different from having two f next to each other. As pdf miner can do it, it would be weird not to do it in pypdf. For instance, I'd like to use pypdf but if it cannot read ff and equations correctly then it's a problem.
For the equations, here is what I get from pypdf for the first two equations: 0=gabrarb: (1) rb=@b: (2)
and from pdfminer:
0 = gab∇a∇bφ.
(1)
∇bφ = ∂bφ.
(2)
Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?
Acrobat reader is reading the ff correctly.
You have to compare output from copy/paste not display
Basically, pypdf is replacing delta by r and phi by : and nabla by @. It doesn't handle greek letters?
again: compare output from clipboard.
Indeed, a copy paste from Acrobat reader displays what pypdf gives but the pdf reader from Mac or from the browser display what pdfminer gives. I'm reading the pdf and showing them on my website so I really need a library that can do the job as good as pdfminer. I don't think Acrobat is a reference anymore and at least they show the right ff and equations in the reader and the pypdf "python" reader should do the same in my opinion and give a text that can be understood and read by humans.
I'm closing this as "not planned" as we simply don't have anybody to work on this.
However, I will add a new benchmark looking at math text extraction: https://github.com/py-pdf/benchmarks/issues/6
Explanation
I tried to read a pdf file with PyPDF2 in asynchronous mode but it didn't work.
I tried using the aiofiles open-source library accessible on GitHub with async with aiofiles.open(pdf_filename, 'rb') as file:
but then the PyPDF2 functions would return an error.
Am I doing something wrong or async is not implemented?