Open JesseTG opened 6 years ago
Hi. I am aware of the issue too. This impl. seems to be fine on smaller word embedding (e.g. word2vec distributed by Mikolov on Google code) but extremely slow on larger files (e.g. glove). I will look into issue and see if I can fix it.
Using numpy.loadtxt
on the aforementioned data set is much faster, though I don't have numbers on-hand right now. This is all I did:
import numpy
def load_word_embeddings(path: str, dim: int) -> numpy.ndarray:
return numpy.loadtxt(fname=path, usecols=range(1, dim), encoding="utf8")
usecols
starts at 1 because column 0 is the word. The data set must be utf8
because the Twitter word vector set has a lot of non-Latin characters.
I'm trying to load this model from this project. However, it's very slow, even on an HPC cluster; so slow that I'm not even sure it's actually loading. It could be in an infinite loop for all I know.
Is this library even meant for large files like this?