py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.38k stars 1.41k forks source link

Fixing Text Extraction Order For Arabic+Digits+Punctuation #1629

Open naourass opened 1 year ago

naourass commented 1 year ago

Explanation

When you have Arabic text mixed with digits, the text extraction order is messed up. Below is an example.

  1. Reading from right to left, here's the ground truth of a file with two blocks:
  1. Here's how the pdf is rendered:

image

  1. Here's the result of page.extract_text(): (2023 0 ﻳﻨﺎﻳ18) 1444 ة0 ﺟﻤﺎدى اﻵﺧ2 5161 ﺋﻴﴘ - ﻋﺪد0ﻢ اﻟ5اﻟﻘ

Attachements:

pubpub-zz commented 1 year ago

@naourass, At first sight(but maybe I'm wrong) you should have a look at the concatenation of output and text (below if check_crlf_space lines 1871 and below)

tell me if you want to try to propose a PR.

naourass commented 1 year ago

@pubpub-zz From my first analysis, I think that the concatenation flow should be changed to handle more cases. I'm also inspecting whether it would be possible to fix this using Control Characters.

pubpub-zz commented 1 year ago

@pubpub-zz From my first analysis, I think that the concatenation flow should be changed to handle more cases.

That's clearly an option to look at

I'm also inspecting whether it would be possible to fix this using Control Characters.

Not sure all the programs will handle that. I would prefer to not use this if possible

naourass commented 1 year ago

There's also a decoding issue for some characters. To focus on inspecting the concatenation order issue, I'm manually overriding them by adding a temporary cmap_override argument to extract_text():

# _page.py
for x in t:
    hex_x = hex(ord(x))
    if hex_x in cmap_override:
        cmap[1][x] = cmap_override[hex_x]
    print(ord(x), hex_x, x, cmap[1][x] if x in cmap[1] else "-", sep="\t")
# my-app.py
cmap_override = {
    "0x27f": "سي",
    # "0x3": " ",
    # "0x206": "ا",
    # "0x273": "ن",
}
text = page.extract_text(cmap_override=cmap_override)
naourass commented 1 year ago

@pubpub-zz I have an update regarding this issue.

I'm not a BiDi expert (yet), but after further inspection, here's my humble conclusion so far:

There still might be some heuristic indicators or other approaches to handle/detect the overall direction which I couldn't find at the moment. I'll be investigating this further when possible and I'll report if I find anything useful.

MartinThoma commented 1 year ago

Thank you for looking into this topic 💙

or implement a machine learning model to predict it

Adding machine learning to pypdf seems out of scope to be. Adding a hook for external code / another library would be fine to be

naourass commented 1 year ago

@pubpub-zz @MartinThoma After more experimentation, it looks like it's much simpler to just drop the RTL dir checks, process everything as LTR to provide the "logical" version of the text (except for ligatures and paired punc like ()[]{}«»), and let the user call bidi.get_display() to easily get the visual order!

I've started working on an implementation example, I'll let you know when it's ready for review.