nltk / nltk_book

NLTK Book
http://www.nltk.org/book
405 stars 142 forks source link

Encoding of Inaugural texts #237

Open ssquires82 opened 4 years ago

ssquires82 commented 4 years ago

Three of the inaugural files have unanticipated encodings:

2005-Bush.txt - uses Macintosh Chinese Traditional encoding 2013-Obama.txt - uses UTF-8 encoding. 2017-Trump.txt - uses UTF-8 encoding. These will produce character and word errors, if read in as ISOLatin1.

In Macintosh Chinese Traditional encoding, a decimal 161 byte is an escape character. The following two-byte translations are needed for the 2005-Bush.txt file:

{161,88} to -
{161,166} to '
{161,75} to ... {161,167} to " {161,168} to "