misja / python-boilerpipe

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
Other
537 stars 143 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 20: ordinal not in range(128) #18

Open marcoippolito opened 10 years ago

marcoippolito commented 10 years ago

Hi, when running this code on my Ubuntu 12.04 micro-instance:

!/usr/bin/python

import boilerpipe

from boilerpipe.extract import Extractor extractor = Extractor(extractor='ArticleExtractor', url="http://europe.wsj.com/home-page") extracted_text = extractor.getText() print extracted_text extracted_html = extractor.getHTML()

I get this error: python boilerpipeTrial.py Traceback (most recent call last): File "boilerpipeTrial.py", line 9, in print extracted_text UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 20: ordinal not in range(128)

where line 9 is: print extracted_text

Would please give me some hints on how to solve it?

Kind regards. Marco

marcoippolito commented 10 years ago

I solved this issue by adding: extracted_text_u = extracted_text.encode('utf-8','replace') print extracted_text_u

Any contraindications with this adding?