Open jsvine opened 1 year ago
Hi @jsvine, is there a workaround for this in the meantime?
Can I manually apply a normalize function to all text in the PDF?
Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case?
@jsvine thanks for the answer. Basically, I am trying to do some split by ;
(U+003B) but the PDF seems to have a different ;
(U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic
The definitive rules are defined in the Unicode spec ( https://unicode.org/reports/tr15/). It needs careful reading ("Taken step-by-step, the Unicode Normalization Algorithm is fairly complex"). It specifically discusses the Greek question mark. There are different formal approaches
The four Unicode Normalization Forms are summarized in Table 1.
Table 1. Normalization Forms https://unicode.org/reports/tr15/#Normalization_Forms_Table FormDescription Normalization Form D (NFD) Canonical Decomposition Normalization Form C (NFC) Canonical Decomposition, followed by Canonical Composition Normalization Form KD (NFKD) Compatibility Decomposition Normalization Form KC (NFKC) Compatibility Decomposition, followed by Canonical Composition
===== 10 Respecting Canonical Equivalence https://unicode.org/reports/tr15/#Canonical_Equivalence
This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical-equivalent inputs always produce canonical-equivalent outputs. For a function that transforms one string into another, this may also be called preserving canonical equivalence. There are a number of important aspects to this concept:
<<< It's important we adhere precisely to Unicode terminology and philosophy
For me (a crystallographer) it's the equivalence between Aring and Angstrom (which are frequently misused. Note that Aring if further complicated and may have to be normalised 0041 (A) + 030A (combining ring) => 00C5 (Aring)
The problems frequently arise when authors pick symbols from menus without realising what character results.
There are a lot of further illiteracies which probably can't be dealt with, e.g. em-dash for minus
On Tue, Jul 16, 2024 at 2:23 PM Agus Luques @.***> wrote:
@jsvine https://github.com/jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic
— Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/905#issuecomment-2230882359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2BQYIOJARAT3TN5ULZMUNGHAVCNFSM6AAAAABKVC3SXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQHA4DEMZVHE . You are receiving this because you were mentioned.Message ID: @.***>
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Per @petermr's suggestion in https://github.com/jsvine/pdfplumber/discussions/904#discussioncomment-6149469, I think it's a good idea to add such a parameter/option, using
unicodedata.normalize(...)
— in a similar vein to theexpand_ligatures
parameter added in v0.9.0. I'll look into this.Some useful reference links, as a note-to-self: