stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Input file format #29

Closed armintabari closed 8 years ago

armintabari commented 8 years ago

What is the input file format? I know it should be whitespace separated tokens, but how you define the document boundaries?

ghost commented 8 years ago

If open up the text8 example file, you'll get a sense of how the input file should be formatted. Documents are just concatenated together with spaces. If the documents in your data are quite short, it's possible this would become a problem, and you could likely get away with inserting a dummy token padding between documents. Does that make sense?

armintabari commented 8 years ago

Thank you for the response. It should count the co-occurrence of words in documents. So how does it figure out when one document ends and the other starts if documents and the tokens that make them are both separated by space? I am talking about the input to vocab_count an how should I prepare my corpus for it. The text8 file is all one line. Shouldn't glove know about documents to count the co-occurrence of words?

ghost commented 8 years ago

The same issue was brought up here if you want to learn more. I'll try to add something to the readme to clear up that this is indeed the intended behavior.

armintabari commented 8 years ago

Okay,Thank you. I got it. But what is the default length for a document in GloVe and how can I change it? Is it the "MAX_STRING_LENGTH 1000" ?

ghost commented 8 years ago

I've added some corpus creation instructions to the readme in https://github.com/stanfordnlp/GloVe/pull/31. Does that help make sense of things?

armintabari commented 8 years ago

Thanks, but I did not get the default document size, and how I can change it. What do you mean by short document? In short, what is the window size? (The window size in which the algorithm counts the co-occurrences)

ghost commented 8 years ago

Honestly, I don't know what to tell you. GloVe doesn't have a notion of documents and you just need to provide a single text file of words separated by spaces. I've provided a suggested workaround as well if you get bad results from following that recommendation. You can read the paper for more details on the window size.

1049451037 commented 5 years ago

@Russell91 Hi, I think the central question is that how to count word-word cooccurrence if no document boundary? Aren't all the words are cooccurred in one line?

armintabari commented 5 years ago

Sorry I did not respond sooner. I have figured out hoe it works. I had some misunderstanding due to a small mismatch between the paper and the implementation. To stop considering co-occurrence in the boundaries in a multi-document corpus, one just need to add a dummy term which is repeated a number of times depending on your context window.

On Nov 18, 2018, at 8:55 AM, Qingsong Lv notifications@github.com wrote:

@Russell91 https://github.com/Russell91 Hi, I think what the central question is that how to count word-word cooccurrence if no document boundary? Aren't all the words are cooccurred in one line?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/29#issuecomment-439694578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwjo7EpH5cQ-1OGacwucGz8rVueHOd_ks5uwWa-gaJpZM4Ijqu4.