Closed mosescha closed 8 months ago
202010200090007270.pdf the pdf file is attached
fitz module doesn't have same problem about missing character
Hi @mosescha
Here is the error case 株式會社 大韓航空 ==> 大 空
Perhaps some page_number info / screenshots could help?
It's probably a little difficult to locate for non-native speakers.
Is it this item on Page 1?
pdfplumber
uses pdfminer.six
to read the PDF, so it could possibly be a pdfminer issue.
Thanks, @cmdlineluser. It does look like that's the place @mosescha is flagging, based on some exploration I just did. On my computer, if you select that text and then copy-paste it into Mac's TextEdit program, I get get this:
So it seems to be a font / character-encoding issue. If fitz
translates the characters correctly, per @mosescha, perhaps it's something pdfminer.six
could solve. But this seems outside the realm of what can be fixed in pdfplumber
itself.
Hi @mosescha
Here is the error case 株式會社 大韓航空 ==> 大 空
Perhaps some page_number info / screenshots could help?
It's probably a little difficult to locate for non-native speakers.
Is it this item on Page 1? Right
pdfplumber
usespdfminer.six
to read the PDF, so it could possibly be a pdfminer issue.
@jsvine Ah, interesting.
fitz
doesn't seem to handle it on my end.
pypdfium2
seems to do a bit better:
pdfminer gives me those cid:
codes.
상 호 (cid:35128)(cid:33835)(cid:36204)(cid:33383)
There seem to be quite a few issues related to that. https://github.com/pdfminer/pdfminer.six/issues?q=cid
Here is screenshot for fitz
As if you see, It has no problem with missing Chinese character
Hmm, perhaps the difference @mosescha and @cmdlineluser are seeing re. fitz
is related to the operating system (which ones are you on?) or fitz
version?
In any case, this seems like an issue better/best resolved in the pdfminer.six
codebase. For that reason, I'm closing this comment, but feel free to continue the discussion.
@mosescha You should have pypdfium2
installed, you could attempt to use that to parse the text: https://github.com/jsvine/pdfplumber/discussions/962#discussioncomment-6700337 to see if it does a better job.
Describe the bug
A clear and concise description of what the bug is. I fail to extract Chinese characters from PDF Here is the error case 株式會社 大韓航空 ==> 大 空
Have you tried repairing the PDF?
Please try running your code with
pdfplumber.open(..., repair=True)
before submitting a bug report.Code to reproduce the problem
Paste it here, or attach a Python file.
import pdfplumber
pdf_path = "c:\temp\202010200090007270.pdf" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() print(page_text)
PDF file
Please attach any PDFs necessary to reproduce the problem.
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Expected behavior
What did you expect the result should have been?
Actual behavior
What actually happened, instead?
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Additional context
Add any other context/notes about the problem here.