nltk / nltk_contrib

NLTK Contrib
http://nltk.org/
Other
166 stars 136 forks source link

readabilitytests problem with utf-8 characters #17

Closed brendanwood closed 9 years ago

brendanwood commented 9 years ago

I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).

Sample text: http://pastebin.com/eRKGMGYn

Test script: http://pastebin.com/aE2DaRvk

I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.

My fork: https://github.com/priceonomics/nltk_contrib

Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.

kmike commented 9 years ago

Switching ReadabilityTool to unicode is the way to go, and your changes look good. See also: #11.

kmike commented 9 years ago

Thanks! I'll close this ticket, but leave #11 because it seems there are other unicode-related issues which could affect ReadabilityTool.