piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

unicode is not defined #24

Closed avinashsai closed 6 years ago

avinashsai commented 6 years ago

When glove vectors are downloaded and loaded into model it shows 'unicode' not defines

menshikh-iv commented 6 years ago

Hello @avinashsai

  1. What's your python / os / gensim / smart_open version you use?
  2. What're exactly vectors you mean (name)?
avinashsai commented 6 years ago

Iam using google colab notebooks with latest gensim version and python3

My code:

import gensim.downloader as api
info = api.info()
model = api.load("glove-twitter-25")

NameError: name 'unicode' is not defined

menshikh-iv commented 6 years ago

Not enough information, I don't see any unicode call as in your code, same about code of loader. Provide at least full stack-trace, please.

avinashsai commented 6 years ago
NameError                                 Traceback (most recent call last)
<ipython-input-54-cdd45c81647a> in <module>()
----> 1 model = api.load("glove-twitter-25")

/usr/local/lib/python3.6/dist-packages/gensim/downloader.py in load(name, return_path)
    416         sys.path.insert(0, base_dir)
    417         module = __import__(name)
--> 418         return module.load_data()
    419 
    420 

/content/gensim-data/glove-twitter-25/__init__.py in load_data()
      6 def load_data():
      7     path = os.path.join(base_dir, 'glove-twitter-25', 'glove-twitter-25.gz')
----> 8     model = KeyedVectors.load_word2vec_format(path)
      9     return model

/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
   1117         return _load_word2vec_format(
   1118             Word2VecKeyedVectors, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
-> 1119             limit=limit, datatype=datatype)
   1120 
   1121     def get_keras_embedding(self, train_embeddings=False):

/usr/local/lib/python3.6/dist-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
    172     logger.info("loading projection weights from %s", fname)
    173     with utils.smart_open(fname) as fin:
--> 174         header = utils.to_unicode(fin.readline(), encoding=encoding)
    175         vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
    176         if limit:

<ipython-input-49-595fa41a7f04> in any2unicode(text, encoding, errors)
      2     if isinstance(text, str):
      3         return text
----> 4     return unicode(text.replace('\xc2\x85', '<newline>'), encoding, errors=errors)

NameError: name 'unicode' is not defined
menshikh-iv commented 6 years ago

Looks pretty strange (because I see a different code in the codebase) - https://github.com/RaRe-Technologies/gensim/blob/c1e6c65d75c134e71a24fbf9fdecf448972d5316/gensim/utils.py#L339

I also re-check now and this works as expected for 2.7, 3.5 and 3.6. Try to re-install gensim.

I close this issue because this isn't reproducible.

piskvorky commented 6 years ago

Maybe something to do with the google colab environment? Does unicode exist there? I mean no gensim, just directly from the shell/notebook.

menshikh-iv commented 6 years ago

@piskvorky unicode doesn't exist in py3.6, most strange things here is different code in stacktrace.

piskvorky commented 6 years ago

Right. Looks like any2unicode was redefined somewhere by the user.