stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

vocab.txt and vector.txt #42

Closed tejuafonja closed 8 years ago

tejuafonja commented 8 years ago

I am having issue understanding where to get vocab.txt and vector.txt files. I am relatively new to this, please help. Thanks.

ghost commented 8 years ago

Sure, let me first provide some high level info. Users who simply want some word vectors to use in a generic application can download pretrained word vectors as per the README. Users that eventually want to train on their own corpus are encouraged to first train on data that we provide as per the readme, so that they can go through the motions.

In the process on training on our data, you'll generate vocab.txt and vector.txt. So if you follow the readme under https://github.com/stanfordnlp/GloVe#train-word-vectors-on-a-new-corpus, you should be set up!

ghost commented 8 years ago

Unfortunately, that's out of the scope of what we can do to help you. Best of luck!

tejuafonja commented 8 years ago

Thanks.

akshay-vaidya commented 7 years ago

@Russell91 Hi, I am just trying to understand what vectors.txt and vocab.txt contain. First few records of vocab.txt file contain the following. the 1061396 of 593677 and 416629 one 411764 in 372201 a 325873

What does the number corresponding to each word mean?

For instance, What does 1061396 corresponding to the word 'the' mean?

Thanks!

drawar commented 7 years ago

@akshay-vaidya : It's the count of occurrences of each (unique) word in the document.

akshay-vaidya commented 7 years ago

Thanks @drawar

anurag1234567 commented 6 years ago

@akshay-vaidya ,@Russell91 Hi Akshay,

Could you please elaborate the steps for generating vocab.txt and vector.txt file on Windows. I am unable to follow the steps describe on README.

Thanks