Chapter 3 use utf-8-sig for encoding to exclude BOM in book file

In this section:

The following code is used to decode the bytes from the Gutenberg Project webserver:

raw = response.read().decode('utf8')

With Python 3.7.4, the value of raw will contain a byte-order mark (BOM).

'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r'

and the return value of len() will be 1176967 rather than 1176893.

The Python Unicode HOWTO recommends the use of utf-8-sig as an encoding value to exclude the BOM, which really isn't needed for UTF-8.

from urllib import request

url = 'http://www.gutenberg.org/files/2554/2554-0.txt'

response = request.urlopen(url)

raw = response.read().decode(encoding='utf-8-sig')

type(raw)

str

len(raw)

raw[:75]

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

nltk / nltk_book