pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.61k stars 906 forks source link

Supporting RTL Languages #515

Open mkhashoggi opened 3 years ago

mkhashoggi commented 3 years ago

Feature request

vaknin commented 3 years ago

This is important for Hebrew as well.

pietermarsman commented 3 years ago

Hi @mkhashoggi, thanks for the suggestion and the PR. I need some time to familarize myself with the logic and rules of RTL languanges to be able to review the PR.

pietermarsman commented 3 years ago

I did some reading about bi-directional text to understand the changes that we need to make. Please correct me if I'm wrong. I've no experience with right-to-left text, so I might have an overly left-to-right inclined way of thinking.

To summarize:

So what we need is a interpreter for text lines (i.e. LTTextLine) that detects which part of the text is left-to-right and which is right-to-left and add the necessary unicode characters if necessary. This is precisely what PR #516 is about.

mojivalipour commented 3 years ago

I understand that there is a PR #516 in progress regarding this support request. However, I was confused if this feature is currently usable or not. I wondered if there is an example that shows how to use this feature if applicable?

JStyle21 commented 2 years ago

Hi,

If this is implemented what is the hold up here? Can anyone update?

pietermarsman commented 2 years ago

Anyone willing to work on #516? It implements this feature but it has been inactive for a while. And it does need extra work.

barhemo commented 2 years ago

Hey, when it will be ready? right now when i try to read from a pdf, the hebrew characters are missing

pietermarsman commented 2 years ago

No one is working on this currently

taneron commented 2 months ago

I recently needed this and tried to hack around until I got something that worked for me Maybe its useful for someone https://pypi.org/project/pdfminer.rtl