Closed ZiggyReQurv closed 7 months ago
Actually PdfReader
does support math symbols as well as any other Unicode characters. I had a look at this specific file and they use \x03
, \x01
, \x02
etc. codes to represent math symbols, which doesn't match Unicode math symbols. However they use a custom embedded type1 font with glyphs for the codes above, which makes these codes look correctly.
I can suggest you to post-process the output and replace symbols like \x03
with the correct unicode characters.
I have the need to extract math symbols like ∈ or ∀ from a PDF (attached) 24_AIE604_C1 Algebra1_extract.pdf
I found (https://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode) that Unicode provides a comprehensive character repertoire.
I wonder if it's possible to implement the project in a way that sets the Unicode for example, the text is converted to: Per esempio: 2 \x01 2 \x01 2 \x01 2 \x01 2 \x01 2 = 26 (si legge “2 alla sesta”) ; 2 si chiama base della potenza e 6 esponente della potenza. Si pone: a0 = 1 a \x18 0; a1 = a \x03a \x02 N
I would like to use the correct Unicode in order to make the conversion of math text, I think that this would be a great improvement to the library. I will try to work on the library myself but any suggestion or guide in relation to the best way to do it would be appreciated. thanks a lot