Closed ItTestKing closed 4 years ago
Hi @Likangkang08 , SeleniumBase uses the external library, PyPDF2, for reading from PDFs. That GitHub repo is here: https://github.com/mstamy2/PyPDF2 , and it looks like there's already an open issue on reading Chinese characters from PDFs here: https://github.com/mstamy2/PyPDF2/issues/252 Unless they make updates to their library, (or you know of a better one to use), I have no way of adding this feature directly to SeleniumBase.
Dear @mdmintz I found a library that can handle Chinese pdf: https://pypi.org/project/pdfminer3k/, code implementation: from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO
def convert_pdf(path, page=1): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, pageno=page, laparams=laparams) fp = open(path, 'rb') process_pdf(rsrcmgr, device, fp) fp.close() device.close() str = retstr.getvalue() retstr.close() return str file = r'D:\Download\Admin-Guide.pdf' print(convert_pdf(file))
https://pypi.org/project/pdfminer.six/ looks newer. The library you sent me was last updated in 2011, and looks out-of-date. I'll use the PDF you sent me earlier to test my updates to SeleniumBase. Might take a few days.
Both pdfs are in Chinese. unittest.pdf test.pdf
Thank you. I'll use those for testing.
You're welcome.
I'm nearly done. I ran into an issue where the same Chinese character was appearing as a different unicode code:
The unicode encoding of Chinese characters is different in html and pdf. On the left is the html code,On the right is the pdf code. 行(\u884c)---- ⾏(\u2f8f) 方(\u65b9)---- ⽅(\u2f45) 人(\u4eba)---- ⼈(\u2f08) (Html encoding is utf-8.)
This meant that comparisons weren't working properly until converting Chinese characters into the same unicode format.
Replacing the Chinese PDF characters with Chinese HTML characters fixes the problem I had:
text = text.replace(u'\u2f8f', u'\u884c')
text = text.replace(u'\u2f45', u'\u65b9')
text = text.replace(u'\u2f08', u'\u4eba')
text = text.replace(u'\u2f70', u'\u793a')
This should be a good solution.
@Likangkang08 It's ready! https://github.com/seleniumbase/SeleniumBase/releases/tag/v1.33.8
I tried it, and it perfectly solved the Chinese problem. thank you for your help.
PyPDF2 should now work directly with Chinese characters.
@MartinThoma PyPDF2
is still broken on version 2.1.0
:
Tested with: https://seleniumbase.io/cdn/pdf/unittest_zh.pdf
Code:
from PyPDF2 import PdfReader
file_path = "/Users/michael/github/SeleniumBase/examples/downloaded_files/unittest_zh.pdf"
reader = PdfReader(file_path)
page = reader.pages[1]
pdf_text = page.extract_text()
Debugging output:
ipdb> page
{'/Type': '/Page', '/Parent': IndirectObject(3, 0), '/Resources': IndirectObject(6, 0), '/Contents': IndirectObject(4, 0), '/MediaBox': [0, 0, 1024, 768]}
ipdb> pdf_text = page.extract_text()
*** TypeError: a bytes-like object is required, not 'dict'
The same PyPDF2
code worked on a more simple PDF that didn't contain Chinese characters: https://nostarch.com/download/Automate_the_Boring_Stuff_dTOC.pdf
Given that pdfminer.six
has been working perfectly with SeleniumBase
, I'll be sticking with that.
For example, this file: unittest.pdf