rodricios / eatiht

An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
http://rodricios.github.io/eatiht
MIT License
435 stars 43 forks source link

Only english language? #22

Open laupt82 opened 9 years ago

laupt82 commented 9 years ago

Hi I found your library really interesting. I need to obtain the article content from web pages that may be written in different languages, mostly English and Italian. Unfortunately when I tried to analyze Italian pages, I have encoding problems: "UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 4: character maps to "

eugene-eeo commented 9 years ago

Could you provide the site and the version of Python that you are using (a python --version would do)?

laupt82 commented 9 years ago

Hi, thanks for your fast answer. This is not the same page that I tried before, but the same error is obtained when using your library: http://www.ilsole24ore.com/art/mondo/2015-05-11/bombe-nave-turca-largo-libia-131554.shtml?uuid=ABtub1dD The test was performed under Windows 7, python version: 2.7.3

eugene-eeo commented 9 years ago

Could you provide a stack trace as well? Because at the moment the best you could do is switch to a libextract (same algorithm as eatiht) + requests approach. My guess is that this encoding issue is mostly due to the (hacky) handling of HTTP requests in eatiht.

laupt82 commented 9 years ago

Ok, thanks, I will try libextract.... The error traceback:

Traceback (most recent call last): File "", line 1, in File "C:\Python27\lib\encodings\cp850.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u010d' in position 22: character maps to