lemmatize() botches Umlaut characters

workflowsguy commented 9 years ago

When trying the examples from the tutorial on 2.7.9, blob.words.lemmatize() incorrectly outputs "schön" as "u'sch\xf6n'.

This is the code I used:

#!/usr/bin/python
# -*- coding: utf8 -*-
from textblob_de import TextBlobDE as TextBlob
blob = TextBlob(u"Das Auto ist sehr schön.")
print(blob.parse())
print(blob.words.lemmatize())

Output:

Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O ././O/O
[u'das', u'Auto', u'sein', u'sehr', u'sch\xf6n']

Thanks,

Guy

markuskiller commented 9 years ago

Hi Guy

Thanks for your report.

I think the second print statement in your script would have to be changed to:

>>>print(", ".join(blob.words.lemmatize()))
das, Auto, sein, sehr, schön

The output you've described is not a bug but standard behaviour on Python2 (Python3 would give you the expected output).

EXPLANATION: Your second output line is the standard representation of non-ASCII characters in Python2 data structures and not a bug or botched up characters:

>>>a = [u"schön"]
>>>print(a)
[u'sch\xf6n']
# The umlaut is displayed correctly when the
# string within the list is printed:
>>>print(a[0])
schön

blob.words.lemmatize() returns a list (consistent with textblob main package):

>>>blob.words.lemmatize()
WordList([u'das', u'Auto', u'sein', u'sehr', u'sch\xf6n'])
>>>print(blob.words.lemmatize())
[u'das', u'Auto', u'sein', u'sehr', u'sch\xf6n']

You can either iterate over this list to get the string representations:

>>>for lemma in blob.words.lemmatize():
           print(lemma)
das
Auto
sein
sehr
schön

Or you could use the following statement to print the list of lemmas as one-line string, as suggested above:

>>>print(", ".join(blob.words.lemmatize()))
das, Auto, sein, sehr, schön

workflowsguy commented 9 years ago

Thank you for the explanation, Markus. Using textblob on Python 3 seems to be easier with regards to how "special" characters are handled (had not used Python 2 for a while and forgot about those issues).

Thanks again,

Guy

markuskiller / textblob-de

lemmatize() botches Umlaut characters #12