piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.43k stars 4.36k forks source link

Word2VecKeyedVectors.vocab.keys() broken with chinese characters #2702

Open parap1uie-s opened 4 years ago

parap1uie-s commented 4 years ago

Problem description

A gensim model was trained under Python 2.7 with a chinese dataset.

However, now we are using Python3.6, and we got some broken strings in .vocab.keys() as title.

Any helpful steps to convert a model trained under Python2.7 to compatible with Python3.6?

Thanks in advance.

Steps/code/corpus to reproduce

gmodel = gensim.models.Word2Vec.load('word2vec.emb')
words = gmodel.wv.vocab.keys()
print(words[:10])
['',
 'æ··ç\x9d\x80',
 'è\x82\x9aå\xad\x90大',
 'é\x82\x84ä¹\x8b',
 'DISSFMIINATIOII',
 'é\x83\x91ä¹\x9dç§\x91',
 'æ\x9c\x89è\x85¹é\x83¨æ\x89\x8bæ\x9c¯å\x8f²',
 'è¾\x83ä»\x85ç\x94¨',
 '以ç¬\x94',
 '以ç¬\x91']

Versions

Linux-3.10.0_3-0-0-10-x86_64-with-centos-6.3-Final
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.1.0
gensim 3.7.3
FAST_VERSION 1
gojomo commented 4 years ago

If the model still loads and works as expected in Python 2.7, you might be able to modify the .vocab dict there, to use true unicode strings as keys, then re-save the model for better results in Python 3.x.

Alternatively, a full recipe for both creating (in Python 2.7) a new tiny model with at least one problem word, showing that it works under Py2.7, saving the model and loading in Py3.x, and showing the problem there, might help generate other ideas for patching the model. (WIthout seeing representative code for creating your word2vec.emb or the file itself, it's not clear what might have happened to cause the problem.)