piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Russian fastText embeddings trained on Araneum web corpus #27

Open akutuzov opened 6 years ago

akutuzov commented 6 years ago

Name: fasttext-ru_araneum-300 Link: http://rusvectores.org/static/models/rusvectores4/fasttext/araneum_none_fasttextcbow_300_5_2018.tgz Description: fastText vectors trained on Araneum Russicum Maximum corpus (about 10 billion words). The model contains 196K words and 403K 3-4-5-grams. License: CC-BY (http://rusvectores.org/en/about/) Related papers: https://arxiv.org/abs/1801.06407, https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models Preprocessing: The corpus was lemmatized with Mystem. Parameters: vector size 300, window size 5 Code example:

$ tar xzf araneum_none_fasttextcbow_300_5_2018.tgz
$ python3
model = gensim.models.KeyedVectors.load('araneum_none_fasttextcbow_300_5_2018.model')
for n in model.most_similar(positive=['уточка']):
    print(n[0], round(n[1], 3))
чуточка 0.754
досочка 0.726
пинеточка 0.724
деточка 0.704
улиточка 0.693
нямочка 0.693
белочка 0.69
квочка 0.69
выточка 0.689
козочка 0.683
akutuzov commented 6 years ago

One can lemmatize Russian texts before using this model, with the help of pymystem:

def tag(word):
    from pymystem3 import Mystem
    m = Mystem()
    processed = m.analyze(word)[0]
    lemma = processed["analysis"][0]["lex"].lower().strip()
    return lemma

tag('стульев')
стул
andrei-q commented 5 years ago

I got the following error:

>>> model = gensim.models.fasttext.FastText.load('araneum_none_fasttextcbow_300_5_2018.model')
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 936, in load
    model = super(FastText, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/base_any2vec.py", line 1247, in load
    if not hasattr(model.vocabulary, 'ns_exponent'):
AttributeError: 'FastTextKeyedVectors' object has no attribute 'vocabulary'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/fasttext.py", line 945, in load
    return load_old_fasttext(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/fasttext.py", line 53, in load_old_fasttext
    old_model = FastText.load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/word2vec.py", line 1618, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 87, in load
    obj = unpickle(fname)
  File "/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/old_saveload.py", line 380, in unpickle
    return _pickle.loads(file_bytes, encoding='latin1')
AttributeError: Can't get attribute 'FastTextKeyedVectors' on <module 'gensim.models.deprecated.keyedvectors' from '/usr/local/lib/python3.5/dist-packages/gensim/models/deprecated/keyedvectors.py'>
akutuzov commented 5 years ago

@andrei-q Gensim fastText code has been refactored since the time this issue was created. In the recent versions of Gensim, you should use gensim.models.KeyedVectors.load() to load this model. I've changed the code snippet above accordingly.

andrei-q commented 5 years ago

Thanks. It works