maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Extract Math symbols from PDF #115

Closed ZiggyReQurv closed 4 months ago

ZiggyReQurv commented 5 months ago

I have the need to extract math symbols like ∈ or ∀ from a PDF (attached) 24_AIE604_C1 Algebra1_extract.pdf

I found (https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode) that Unicode provides a comprehensive character repertoire.

I wonder if it's possible to implement the project in a way that sets the Unicode for example, the text Screenshot 2024-04-06 alle 10 24 06 is converted to: Per esempio: 2 \x01 2 \x01 2 \x01 2 \x01 2 \x01 2 = 26 (si legge “2 alla sesta”) ; 2 si chiama base della potenza e 6 esponente della potenza. Si pone: a0 = 1 a \x18 0; a1 = a \x03a \x02 N

I would like to use the correct Unicode in order to make the conversion of math text, I think that this would be a great improvement to the library. I will try to work on the library myself but any suggestion or guide in relation to the best way to do it would be appreciated. thanks a lot

maxpmaxp commented 4 months ago

Actually PdfReader does support math symbols as well as any other Unicode characters. I had a look at this specific file and they use \x03, \x01, \x02 etc. codes to represent math symbols, which doesn't match Unicode math symbols. However they use a custom embedded type1 font with glyphs for the codes above, which makes these codes look correctly.

I can suggest you to post-process the output and replace symbols like \x03 with the correct unicode characters.