nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

UnicodeDecodeError when loading embeddings using nlppipe Preprocessor method load_glove #64

Open BrutishGuy opened 4 years ago

BrutishGuy commented 4 years ago

Hi, I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings

Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.

from lda2vec.nlppipe import Preprocessor
P = Preprocessor(YOUR_DF, "ANY_TEXT_COLUMN", max_features=30000, maxlen=10000, min_count=30)
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.100d.txt")

The specific error thrown is as below:

Traceback (most recent call last):

  File "<ipython-input-4-e5cf0a369051>", line 3, in <module>
    embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.300d.txt")

  File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in load_glove
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

  File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in <genexpr>
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>

I believe this can be fixed by adding

, encoding="utf8" to the call to the function open on line 127 of the nlppipe.py code.

nateraw commented 4 years ago

Feel free to make a PR if you have a fix - I don't have the bandwidth to work on this repo anymore. Cheers