"get_pdf_text ()", this method, when the PDF is Chinese, the obtained text is garbled.

seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.

https://seleniumbase.io

MIT License

5.33k stars 979 forks source link

"get_pdf_text ()", this method, when the PDF is Chinese, the obtained text is garbled. #431

Closed ItTestKing closed 4 years ago

ItTestKing commented 4 years ago

For example, this file： CodingProblem pdf unittest.pdf

mdmintz commented 4 years ago

Hi @Likangkang08 , SeleniumBase uses the external library, PyPDF2, for reading from PDFs. That GitHub repo is here: https://github.com/mstamy2/PyPDF2 , and it looks like there's already an open issue on reading Chinese characters from PDFs here: https://github.com/mstamy2/PyPDF2/issues/252 Unless they make updates to their library, (or you know of a better one to use), I have no way of adding this feature directly to SeleniumBase.

ItTestKing commented 4 years ago

Dear @mdmintz I found a library that can handle Chinese pdf: https://pypi.org/project/pdfminer3k/, code implementation: from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO

def convert_pdf(path, page=1): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, pageno=page, laparams=laparams) fp = open(path, 'rb') process_pdf(rsrcmgr, device, fp) fp.close() device.close() str = retstr.getvalue() retstr.close() return str file = r'D:\Download\Admin-Guide.pdf' print(convert_pdf(file))

mdmintz commented 4 years ago

https://pypi.org/project/pdfminer.six/ looks newer. The library you sent me was last updated in 2011, and looks out-of-date. I'll use the PDF you sent me earlier to test my updates to SeleniumBase. Might take a few days.

ItTestKing commented 4 years ago

Both pdfs are in Chinese. unittest.pdf test.pdf

mdmintz commented 4 years ago

Thank you. I'll use those for testing.

ItTestKing commented 4 years ago

You're welcome.

mdmintz commented 4 years ago

I'm nearly done. I ran into an issue where the same Chinese character was appearing as a different unicode code:

The unicode encoding of Chinese characters is different in html and pdf. On the left is the html code，On the right is the pdf code. 行（\u884c）---- ⾏（\u2f8f）方（\u65b9）---- ⽅（\u2f45）人（\u4eba）---- ⼈（\u2f08） (Html encoding is utf-8.)

This meant that comparisons weren't working properly until converting Chinese characters into the same unicode format.

mdmintz commented 4 years ago

Replacing the Chinese PDF characters with Chinese HTML characters fixes the problem I had:

text = text.replace(u'\u2f8f', u'\u884c')
text = text.replace(u'\u2f45', u'\u65b9')
text = text.replace(u'\u2f08', u'\u4eba')
text = text.replace(u'\u2f70', u'\u793a')

ItTestKing commented 4 years ago

This should be a good solution.

mdmintz commented 4 years ago

@Likangkang08 It's ready! https://github.com/seleniumbase/SeleniumBase/releases/tag/v1.33.8

ItTestKing commented 4 years ago

I tried it, and it perfectly solved the Chinese problem. thank you for your help.

MartinThoma commented 2 years ago

PyPDF2 should now work directly with Chinese characters.

mdmintz commented 2 years ago

@MartinThoma PyPDF2 is still broken on version 2.1.0: Tested with: https://seleniumbase.io/cdn/pdf/unittest_zh.pdf

Code:

from PyPDF2 import PdfReader
file_path = "/Users/michael/github/SeleniumBase/examples/downloaded_files/unittest_zh.pdf"
reader = PdfReader(file_path)
page = reader.pages[1]
pdf_text = page.extract_text()

Debugging output:

ipdb> page
{'/Type': '/Page', '/Parent': IndirectObject(3, 0), '/Resources': IndirectObject(6, 0), '/Contents': IndirectObject(4, 0), '/MediaBox': [0, 0, 1024, 768]}
ipdb> pdf_text = page.extract_text()
*** TypeError: a bytes-like object is required, not 'dict'

The same PyPDF2 code worked on a more simple PDF that didn't contain Chinese characters: https://nostarch.com/download/Automate_the_Boring_Stuff_dTOC.pdf

Given that pdfminer.six has been working perfectly with SeleniumBase, I'll be sticking with that.