pdfplumber characters missing ( for Chinese character )

mosescha commented 8 months ago

Describe the bug

A clear and concise description of what the bug is. I fail to extract Chinese characters from PDF Here is the error case 株式會社大韓航空 ==> 大空

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber

pdf_path = "c:\temp\202010200090007270.pdf" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() print(page_text)

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

pdfplumber version: [e.g., 0.10.2]
Python version: [e.g., 3.9.13]
OS: [Windows 10.]

Additional context

Add any other context/notes about the problem here.

mosescha commented 8 months ago

202010200090007270.pdf the pdf file is attached

mosescha commented 8 months ago

fitz module doesn't have same problem about missing character

cmdlineluser commented 8 months ago

Hi @mosescha

Here is the error case 株式會社大韓航空 ==> 大空

Perhaps some page_number info / screenshots could help?

It's probably a little difficult to locate for non-native speakers.

Is it this item on Page 1?

pdfplumber uses pdfminer.six to read the PDF, so it could possibly be a pdfminer issue.

jsvine commented 8 months ago

Thanks, @cmdlineluser. It does look like that's the place @mosescha is flagging, based on some exploration I just did. On my computer, if you select that text and then copy-paste it into Mac's TextEdit program, I get get this:

So it seems to be a font / character-encoding issue. If fitz translates the characters correctly, per @mosescha, perhaps it's something pdfminer.six could solve. But this seems outside the realm of what can be fixed in pdfplumber itself.

mosescha commented 8 months ago

Hi @mosescha

Here is the error case 株式會社大韓航空 ==> 大空

Perhaps some page_number info / screenshots could help?

It's probably a little difficult to locate for non-native speakers.

Is it this item on Page 1? Right

pdfplumber uses pdfminer.six to read the PDF, so it could possibly be a pdfminer issue.

cmdlineluser commented 8 months ago

@jsvine Ah, interesting.

fitz doesn't seem to handle it on my end.

pypdfium2 seems to do a bit better:

pdfminer gives me those cid: codes.

상  호  (cid:35128)(cid:33835)(cid:36204)(cid:33383)

There seem to be quite a few issues related to that. https://github.com/pdfminer/pdfminer.six/issues?q=cid

mosescha commented 8 months ago

Here is screenshot for fitz

As if you see, It has no problem with missing Chinese character

jsvine commented 8 months ago

Hmm, perhaps the difference @mosescha and @cmdlineluser are seeing re. fitz is related to the operating system (which ones are you on?) or fitz version?

In any case, this seems like an issue better/best resolved in the pdfminer.six codebase. For that reason, I'm closing this comment, but feel free to continue the discussion.

cmdlineluser commented 8 months ago

@mosescha You should have pypdfium2 installed, you could attempt to use that to parse the text: https://github.com/jsvine/pdfplumber/discussions/962#discussioncomment-6700337 to see if it does a better job.

jsvine / pdfplumber