jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

pdfplumber characters missing ( for Chinese character ) #1022

Closed mosescha closed 8 months ago

mosescha commented 8 months ago

Describe the bug

A clear and concise description of what the bug is. I fail to extract Chinese characters from PDF Here is the error case 株式會社 大韓航空 ==> 大 空

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

Code to reproduce the problem

Paste it here, or attach a Python file.

import pdfplumber

pdf_path = "c:\temp\202010200090007270.pdf" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() print(page_text)

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

What did you expect the result should have been?

Actual behavior

What actually happened, instead?

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

Additional context

Add any other context/notes about the problem here.

mosescha commented 8 months ago

202010200090007270.pdf the pdf file is attached

mosescha commented 8 months ago

fitz module doesn't have same problem about missing character

cmdlineluser commented 8 months ago

Hi @mosescha

Here is the error case 株式會社 大韓航空 ==> 大 空

Perhaps some page_number info / screenshots could help?

It's probably a little difficult to locate for non-native speakers.

Is it this item on Page 1?

Screen Shot 2023-10-25 at 16 42 13

pdfplumber uses pdfminer.six to read the PDF, so it could possibly be a pdfminer issue.

jsvine commented 8 months ago

Thanks, @cmdlineluser. It does look like that's the place @mosescha is flagging, based on some exploration I just did. On my computer, if you select that text and then copy-paste it into Mac's TextEdit program, I get get this:

Screenshot 2023-10-25 at 5 24 24 PM

So it seems to be a font / character-encoding issue. If fitz translates the characters correctly, per @mosescha, perhaps it's something pdfminer.six could solve. But this seems outside the realm of what can be fixed in pdfplumber itself.

mosescha commented 8 months ago

Hi @mosescha

Here is the error case 株式會社 大韓航空 ==> 大 空

Perhaps some page_number info / screenshots could help?

It's probably a little difficult to locate for non-native speakers.

Is it this item on Page 1? Right

Screen Shot 2023-10-25 at 16 42 13

pdfplumber uses pdfminer.six to read the PDF, so it could possibly be a pdfminer issue.

cmdlineluser commented 8 months ago

@jsvine Ah, interesting.

fitz doesn't seem to handle it on my end.

Screen Shot 2023-10-26 at 00 26 30



pypdfium2 seems to do a bit better:

Screen Shot 2023-10-26 at 00 27 14



pdfminer gives me those cid: codes.

상  호  (cid:35128)(cid:33835)(cid:36204)(cid:33383)

There seem to be quite a few issues related to that. https://github.com/pdfminer/pdfminer.six/issues?q=cid

mosescha commented 8 months ago

image Here is screenshot for fitz

As if you see, It has no problem with missing Chinese character

jsvine commented 8 months ago

Hmm, perhaps the difference @mosescha and @cmdlineluser are seeing re. fitz is related to the operating system (which ones are you on?) or fitz version?

In any case, this seems like an issue better/best resolved in the pdfminer.six codebase. For that reason, I'm closing this comment, but feel free to continue the discussion.

cmdlineluser commented 8 months ago

@mosescha You should have pypdfium2 installed, you could attempt to use that to parse the text: https://github.com/jsvine/pdfplumber/discussions/962#discussioncomment-6700337 to see if it does a better job.