Closed ghost closed 13 years ago
Did you try passing encoding='iso-8859-1'
argument to train() ?
Ah sorry - missed that in the Documentation. It works :-)
an auto-detect from http response or html meta tag would be nice to have.
Yes, though the scrapely library is really meant to be used with unicode.
For auto-detecting response encodings you can use other libraries that already do that job very well (like Scrapy) since it's not trivial at all to discover web pages encoding reliably.
Trying to scrape pages with a content-encoding of iso-8859-1 throws a unicode error: