The words list in the corpus object is currently stored as a String rather than a Unicode array, which causes characters like ñ to be entered as a+-. To fix we will need to modify the Corpus object and have the corpusbuilder module use the codecs.open command instead.
The np.asarray in line 122 of base corpus should automatically interpolate a Unicode type if the strings being passed are Python unicode objects. Issue originates in corpusbuilders.
The words list in the corpus object is currently stored as a String rather than a Unicode array, which causes characters like
ñ
to be entered asa+-
. To fix we will need to modify the Corpus object and have the corpusbuilder module use thecodecs.open
command instead.