nltk / nltk_book

NLTK Book
http://www.nltk.org/book
403 stars 143 forks source link

Chapter 3 use utf-8-sig for encoding to exclude BOM in book file #225

Open pjhinton opened 4 years ago

pjhinton commented 4 years ago

In this section:

http://www.nltk.org/book/ch03.html#accessing-text-from-the-web-and-from-disk

The following code is used to decode the bytes from the Gutenberg Project webserver:

raw = response.read().decode('utf8')

With Python 3.7.4, the value of raw will contain a byte-order mark (BOM).

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

and the return value of len() will be 1176967 rather than 1176893.

The Python Unicode HOWTO recommends the use of utf-8-sig as an encoding value to exclude the BOM, which really isn't needed for UTF-8.

from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554-0.txt'
response = request.urlopen(url)
raw = response.read().decode(encoding='utf-8-sig')
type(raw)
str
len(raw)
1176966
raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'