Extremely slow when loading Stanford's Twitter GloVe model

koreyou / word_embedding_loader

Loaders and savers for different implentations of word embedding

MIT License

3 stars 2 forks source link

Extremely slow when loading Stanford's Twitter GloVe model #7

Open JesseTG opened 6 years ago

JesseTG commented 6 years ago

I'm trying to load this model from this project. However, it's very slow, even on an HPC cluster; so slow that I'm not even sure it's actually loading. It could be in an infinite loop for all I know.

Is this library even meant for large files like this?

koreyou commented 6 years ago

Hi. I am aware of the issue too. This impl. seems to be fine on smaller word embedding (e.g. word2vec distributed by Mikolov on Google code) but extremely slow on larger files (e.g. glove). I will look into issue and see if I can fix it.

JesseTG commented 6 years ago

Using numpy.loadtxt on the aforementioned data set is much faster, though I don't have numbers on-hand right now. This is all I did:

import numpy

def load_word_embeddings(path: str, dim: int) -> numpy.ndarray:
    return numpy.loadtxt(fname=path, usecols=range(1, dim), encoding="utf8")

usecols starts at 1 because column 0 is the word. The data set must be utf8 because the Twitter word vector set has a lot of non-Latin characters.