Closed armintabari closed 8 years ago
If open up the text8 example file, you'll get a sense of how the input file should be formatted. Documents are just concatenated together with spaces. If the documents in your data are quite short, it's possible this would become a problem, and you could likely get away with inserting a dummy token padding between documents. Does that make sense?
Thank you for the response. It should count the co-occurrence of words in documents. So how does it figure out when one document ends and the other starts if documents and the tokens that make them are both separated by space? I am talking about the input to vocab_count an how should I prepare my corpus for it. The text8 file is all one line. Shouldn't glove know about documents to count the co-occurrence of words?
The same issue was brought up here if you want to learn more. I'll try to add something to the readme to clear up that this is indeed the intended behavior.
Okay,Thank you. I got it. But what is the default length for a document in GloVe and how can I change it? Is it the "MAX_STRING_LENGTH 1000" ?
I've added some corpus creation instructions to the readme in https://github.com/stanfordnlp/GloVe/pull/31. Does that help make sense of things?
Thanks, but I did not get the default document size, and how I can change it. What do you mean by short document? In short, what is the window size? (The window size in which the algorithm counts the co-occurrences)
Honestly, I don't know what to tell you. GloVe doesn't have a notion of documents and you just need to provide a single text file of words separated by spaces. I've provided a suggested workaround as well if you get bad results from following that recommendation. You can read the paper for more details on the window size.
@Russell91 Hi, I think the central question is that how to count word-word cooccurrence if no document boundary? Aren't all the words are cooccurred in one line?
Sorry I did not respond sooner. I have figured out hoe it works. I had some misunderstanding due to a small mismatch between the paper and the implementation. To stop considering co-occurrence in the boundaries in a multi-document corpus, one just need to add a dummy term which is repeated a number of times depending on your context window.
On Nov 18, 2018, at 8:55 AM, Qingsong Lv notifications@github.com wrote:
@Russell91 https://github.com/Russell91 Hi, I think what the central question is that how to count word-word cooccurrence if no document boundary? Aren't all the words are cooccurred in one line?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/29#issuecomment-439694578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwjo7EpH5cQ-1OGacwucGz8rVueHOd_ks5uwWa-gaJpZM4Ijqu4.
What is the input file format? I know it should be whitespace separated tokens, but how you define the document boundaries?