Open MartinThoma opened 1 year ago
@MartinThoma:
Given these two examples above, why did the extraction of:
LibreOffice-Writer.pdf -> The square of x is denoted by x², the cube by x³.
which is perfect :)
But:
pdflatex-x-square.pdf -> x2= 9 means x∈{3,−3}.
a wonderfull tool to do analysis is pdfbox in debug view for the Libreoffice when you look at the used font you will see:
however for the pdflatex, they are changing font size and position
I didn't analyze it so far but I guess that Libre office makes use of the Unicode symbol. In contrast, latex changes the font size / position of a normal "2"
This is my example but it is empty when extracted?
Screenshot 2023-07-30 at 20.07.16.pdf
Text too large and pixelation issue?
so taking a screenshot I thought need tesseract OCR instead? Does it have python ?
@miriam-z Please ask your questions in https://github.com/py-pdf/pypdf/discussions/categories/q-a
Explanation
Superscripts are common in math, especially squares (e.g. x²) and cubes (e.g. x³).
Code Example
How would your feature be used? (Remove this if it is not applicable.)
Examples with the expected output: