ptwobrussell / Mining-the-Social-Web-2nd-Edition

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)
http://bit.ly/135dHfs
Other
2.9k stars 1.49k forks source link

chapter 5 Example 5-1, using boilerpipe #131

Closed hvd closed 10 years ago

hvd commented 10 years ago

Invoking Extractor.getText() on Python 2.7 raises an UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 470: ordinal not in range(128)

This is easily fixed by applying a encode before printing it: print extractor.getText().encode('utf-8')

ptwobrussell commented 10 years ago

@hvd Thanks for reporting this issue. Are you running this example in IPython Notebook or through a Python or IPython interpreter session?

hvd commented 10 years ago

you are welcome @ptwobrussell I encountered this on a Python interpreter session.

ptwobrussell commented 10 years ago

Thanks! I thought this might be the case.

I wonder if what is going on here is actually an issue with how an ordinary Python interpreter session handles writing UTF-8 to standard out. I've seen things like this before, and tweaking how Python handles sys.stdout helped:

import sys
import codecs
sys.stdout=codecs.getwriter('utf-8')(sys.stdout)
print extractor.getText() # works as "expected" now?

(I may be mistaken, but I think this setting is preconfigured with IPython/IPython Notebook.)

Given that the exception is a UnicodeEncodeError referencing the ascii codec, I think what could be happening is that there is an implicit coercion to ascii that happens in some circumstances when you are attempting to print Unicode to standard out.

I'm curious if the suggestion I've made here deals with the error you are seeing?

ptwobrussell commented 10 years ago

Wanted to check back with you one last time before losing this issue...

hvd commented 10 years ago

Hey Sorry Matthew, Let me get back to you in a day or 2. Best Hersh

On Fri, Feb 14, 2014 at 7:00 AM, Matthew A. Russell < notifications@github.com> wrote:

Wanted to check back with you one last time before losing this issue...

Reply to this email directly or view it on GitHubhttps://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/issues/131#issuecomment-35090581 .

Harshvardhan Kelkar

hvd commented 10 years ago

@ptwobrussell Just checked with your fix, that resolves the issue too. Thanks Hersh

ptwobrussell commented 10 years ago

Excellent! Thanks so much for confirming.