lang-uk / word-vectors-uk

Scripts, stats and tasks for ubertext vectors trained using different models
1 stars 0 forks source link

Our main purpose is to train some good word vectors for our product, called UberText. We did that for UberText v1.0, which was collected back in 2006 for the lexvec, glove and word2vec on full corpus (tokenized, lowercased, lemmatized, lemmatized lowercased) and its parts (fiction, news).

Later that vectors was evaluated using intrinsic evaluation, described here.

Now we are updating the ubertext to the version 2.0. It'll be roughly twice the size of version 1.0, cover more sources and has more perks.

We'd like to update the vectors as well:

For sure it mush include fastText, but we are generally open to any ideas.