jackiekazil / data-wrangling

Code repository for Data Wrangling with Python (O'Reilly)
559 stars 564 forks source link

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

Open zlqs1985 opened 7 years ago

zlqs1985 commented 7 years ago

Hi, thank you for your wonderful book on data wrangling I encountered some issue when I was running the parse_pdf_text.py of chapter 5 in anaconda (python3.5) The IDE show me the followning error message

Traceback (most recent call last):

  File "<ipython-input-10-957ab6bc6f5e>", line 39, in <module>
    for line in openfile:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 46: illegal multibyte sequence

it looks like the code opened the file in text mode with a "gbk" encoding. It should probably be opened in binary mode? I'm not sure. How can I fix this problem? thank you.

kjam commented 7 years ago

Hi there,

Can you change this line near the top of the file:

openfile = open(pdf_txt, 'r')

to this:

openfile = open(pdf_txt, 'rb')

And let me know if that works better? Thanks!

-kjam