Closed brendanwood closed 9 years ago
Switching ReadabilityTool to unicode is the way to go, and your changes look good. See also: #11.
Thanks! I'll close this ticket, but leave #11 because it seems there are other unicode-related issues which could affect ReadabilityTool.
I ran into a problem trying to apply the readability tests to a block of text with some UTF-8 characters (fancy quotes).
Sample text: http://pastebin.com/eRKGMGYn
Test script: http://pastebin.com/aE2DaRvk
I'm not very familiar with nltk_contrib, so perhaps I'm just using it wrong...but it seems to fail regardless of whether I pass in a bytestring or unicode string to ReadabilityTool. I forked nltk_contrib and changed textanalyzer.py so that it takes unicode instead of bytes, and that seems to have fixed the problem for me.
My fork: https://github.com/priceonomics/nltk_contrib
Can someone confirm the issue I'm seeing and whether my fix is appropriate? Feel free to merge it back if it's useful.