Hi,
I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings
Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.
from lda2vec.nlppipe import Preprocessor
P = Preprocessor(YOUR_DF, "ANY_TEXT_COLUMN", max_features=30000, maxlen=10000, min_count=30)
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.100d.txt")
The specific error thrown is as below:
Traceback (most recent call last):
File "<ipython-input-4-e5cf0a369051>", line 3, in <module>
embedding_matrix = P.load_glove(EMBEDDING_DIR + "/" + "glove.6B.300d.txt")
File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in load_glove
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
File "C:\ProgramData\Anaconda3\lib\site-packages\lda2vec\nlppipe.py", line 118, in <genexpr>
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>
I believe this can be fixed by adding
, encoding="utf8"
to the call to the function open on line 127 of the nlppipe.py code.
Hi, I am currently experiencing an issue (on Windows, Python 3.7), whereby the Preprocessor class function throws a UnicodeDecodeError when loading pre-trained word embeddings
Sufficient code to reproduce this is simply to instantiate a Preprocessor class with some Pandas dataframe and attempt to load a word embedding using one of the GloVe embedding files. In my case, I am using glove.6B.300d (Wikipedia 2014 + Gigaword 5) taken from the official site as linked on this repository too: https://github.com/stanfordnlp/GloVe. I have attempted using other embeddings as well to no avail. I use 7zip to unpack the zip file in order to retrieve the .txt embeddings as per the lda2vec example provided.
The specific error thrown is as below:
I believe this can be fixed by adding
, encoding="utf8"
to the call to the functionopen
on line 127 of the nlppipe.py code.