Unicode in Corpus objects

inpho / vsm

Vector Space Model Framework developed for InPhO

http://inpho.github.io/vsm

Other

35 stars 14 forks source link

Unicode in Corpus objects #99

Closed JaimieMurdock closed 9 years ago

JaimieMurdock commented 9 years ago

The words list in the corpus object is currently stored as a String rather than a Unicode array, which causes characters like ñ to be entered as a+-. To fix we will need to modify the Corpus object and have the corpusbuilder module use the codecs.open command instead.

JaimieMurdock commented 9 years ago

The np.asarray in line 122 of base corpus should automatically interpolate a Unicode type if the strings being passed are Python unicode objects. Issue originates in corpusbuilders.