senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

UnicodeDecodeError while using word2vec as vocabulary #14

Closed pallavi0335 closed 8 years ago

pallavi0335 commented 8 years ago

Hello I have trained word2vec on my dataset and i want to use the model as an input to TheanoLM As per documentation, word2vec can be provided as vocabulary but i guess the issue is with the format. I am getting UnicodeDecode Error:'utf-8' codec cant decode byte in position 0:invalid start byte. could you please guide me on that. Thanks

senarvi commented 8 years ago

Hi, you can't use word2vec embeddings as input, but you can use the -classes switch to cluster words into classes and use word classes as input. In my experiences, classes created with mkcls work better, but word2vec is a lot faster.