py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.43k stars 1.42k forks source link

ENH: Extract superscripts (x² instead of x2) #2045

Open MartinThoma opened 1 year ago

MartinThoma commented 1 year ago

Explanation

Superscripts are common in math, especially squares (e.g. x²) and cubes (e.g. x³).

Code Example

How would your feature be used? (Remove this if it is not applicable.)

from pypdf import PdfReader

reader = PdfReader("example.pdf)
print(reader.pages[0].extract_text())

Examples with the expected output:

Filename               | Currently extracted     | Expected
---------------------- | ----------------------- | -----------------------
pdflatex-x-square.pdf  | x2= 9 means x∈{3,−3}.   | x²= 9 means x∈{3,−3}.
LibreOffice-Writer.pdf | The square of x is denoted by x², the cube by x³. | Already as expected 🎉
miriam-z commented 1 year ago

@MartinThoma:

Given these two examples above, why did the extraction of:

LibreOffice-Writer.pdf -> The square of x is denoted by x², the cube by x³.

which is perfect :)

But:

pdflatex-x-square.pdf -> x2= 9 means x∈{3,−3}.

pubpub-zz commented 1 year ago

a wonderfull tool to do analysis is pdfbox in debug view for the Libreoffice when you look at the used font you will see: image

however for the pdflatex, they are changing font size and position image

MartinThoma commented 1 year ago

I didn't analyze it so far but I guess that Libre office makes use of the Unicode symbol. In contrast, latex changes the font size / position of a normal "2"

miriam-z commented 1 year ago

This is my example but it is empty when extracted?

Screenshot 2023-07-30 at 20.07.16.pdf

Text too large and pixelation issue?

MartinThoma commented 1 year ago

It's just an image: https://pypdf.readthedocs.io/en/stable/user/extract-text.html#ocr-vs-text-extraction

miriam-z commented 1 year ago

so taking a screenshot I thought need tesseract OCR instead? Does it have python ?

MartinThoma commented 1 year ago

@miriam-z Please ask your questions in https://github.com/py-pdf/pypdf/discussions/categories/q-a