Word2VecKeyedVectors.vocab.keys() broken with chinese characters

piskvorky / gensim

Topic Modelling for Humans

GNU Lesser General Public License v2.1

15.43k stars 4.36k forks source link

Problem description

A gensim model was trained under Python 2.7 with a chinese dataset.

However, now we are using Python3.6, and we got some broken strings in .vocab.keys() as title.

Any helpful steps to convert a model trained under Python2.7 to compatible with Python3.6?

Thanks in advance.

Steps/code/corpus to reproduce

gmodel = gensim.models.Word2Vec.load('word2vec.emb')
words = gmodel.wv.vocab.keys()
print(words[:10])

['',
 'æ··ç\x9d\x80',
 'è\x82\x9aå\xad\x90å¤§',
 'é\x82\x84ä¹\x8b',
 'DISSFMIINATIOII',
 'é\x83\x91ä¹\x9dç§\x91',
 'æ\x9c\x89è\x85¹é\x83¨æ\x89\x8bæ\x9c¯å\x8f²',
 'è¾\x83ä»\x85ç\x94¨',
 'ä»¥ç¬\x94',
 'ä»¥ç¬\x91']

Versions

Linux-3.10.0_3-0-0-10-x86_64-with-centos-6.3-Final
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
gensim 3.7.3
FAST_VERSION 1

piskvorky / gensim

Word2VecKeyedVectors.vocab.keys() broken with chinese characters #2702

Problem description

Steps/code/corpus to reproduce

Versions