fail to read Chinese words pdfs with decoding error

hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background

117 stars 62 forks source link

fail to read Chinese words pdfs with decoding error #135

Open ChicoXYC opened 5 years ago

ChicoXYC commented 5 years ago

Target

We want to extract all the text from thousands of pdfs.

Problem

Decoding problem - Cannot read pdfs with Chinese words

Following are the trying examples we have made so far, but there is some encoding error, you can refer the following for details http://nbviewer.jupyter.org/github/ChicoXYC/exercise/blob/master/get-text-from-pdf/read-chinese-pdf-with-encoding-error.ipynb

the pdfs can be found here: https://github.com/ChicoXYC/exercise/tree/master/get-text-from-pdf/pdfs

lullabymia commented 5 years ago

@hupili Could you please help us ?

hupili commented 5 years ago

may try other tools like “pandoc”

lullabymia commented 5 years ago

Does pandoc also accept files in commonmark, creole, docbook, docx, epub, fb2, gfm, haddock, html, jats, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki (no pdf)

@hupili https://nbviewer.jupyter.org/github/lullabymia/example/blob/master/Untitled.ipynb

hupili commented 5 years ago

How about following two resources?

ChicoXYC commented 5 years ago

I‘ve already tried this method before, and tested it again today, didn't work out.

from pdfminer.pdfpage import PDFPage

error:

 No module named 'pdfminer.pdfpage'

and

TypeError: __init__() got an unexpected keyword argument 'codec'
# didn't find the solution

hupili commented 5 years ago

how about importing pdfminer alone?

ChicoXYC commented 5 years ago

Yes, I tried this module alone. Same problem above. Didn't work out for Chinese

hupili commented 5 years ago

do you mean pdfminer works for English?

ChicoXYC commented 5 years ago

NO. english don't work either. It seems that the module function has been changed and I haven't found the answer. But pypdf2 works for English.