Add `normalize_unicode=False/True` parameter to text extraction methods

jsvine commented 1 year ago

Per @petermr's suggestion in https://github.com/jsvine/pdfplumber/discussions/904#discussioncomment-6149469, I think it's a good idea to add such a parameter/option, using unicodedata.normalize(...) — in a similar vein to the expand_ligatures parameter added in v0.9.0. I'll look into this.

Some useful reference links, as a note-to-self:

agusluques commented 2 weeks ago

Hi @jsvine, is there a workaround for this in the meantime?

Can I manually apply a normalize function to all text in the PDF?

jsvine commented 1 week ago

Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case?

agusluques commented 1 week ago

@jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic

petermr commented 1 week ago

The definitive rules are defined in the Unicode spec ( https://unicode.org/reports/tr15/). It needs careful reading ("Taken step-by-step, the Unicode Normalization Algorithm is fairly complex"). It specifically discusses the Greek question mark. There are different formal approaches

The four Unicode Normalization Forms are summarized in Table 1.

Table 1. Normalization Forms https://unicode.org/reports/tr15/#Normalization_Forms_Table FormDescription Normalization Form D (NFD) Canonical Decomposition Normalization Form C (NFC) Canonical Decomposition, followed by Canonical Composition Normalization Form KD (NFKD) Compatibility Decomposition Normalization Form KC (NFKC) Compatibility Decomposition, followed by Canonical Composition

===== 10 Respecting Canonical Equivalence https://unicode.org/reports/tr15/#Canonical_Equivalence

This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) respects canonical equivalence when canonical-equivalent inputs always produce canonical-equivalent outputs. For a function that transforms one string into another, this may also be called preserving canonical equivalence. There are a number of important aspects to this concept:

The outputs are not required to be identical, only canonically equivalent.
Not all processes are required to respect canonical equivalence. For example:
- A function that collects a set of the General_Category values present in a string will and should produce a different value for <angstrom sign, semicolon> than for <A, combining ring above, greek question mark>, even though they are canonically equivalent.
- A function that does a binary comparison of strings will also find these two sequences different.
Higher-level processes that transform or compare strings, or that perform other higher-level functions, must respect canonical equivalence or problems will result.

<<< It's important we adhere precisely to Unicode terminology and philosophy

For me (a crystallographer) it's the equivalence between Aring and Angstrom (which are frequently misused. Note that Aring if further complicated and may have to be normalised 0041 (A) + 030A (combining ring) => 00C5 (Aring)

The problems frequently arise when authors pick symbols from menus without realising what character results.

There are a lot of further illiteracies which probably can't be dealt with, e.g. em-dash for minus

On Tue, Jul 16, 2024 at 2:23 PM Agus Luques @.***> wrote:

@jsvine https://github.com/jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic

— Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/905#issuecomment-2230882359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2BQYIOJARAT3TN5ULZMUNGHAVCNFSM6AAAAABKVC3SXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQHA4DEMZVHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

jsvine / pdfplumber

Add `normalize_unicode=False/True` parameter to text extraction methods #905