stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

`cooccur` misses out the last token #28

Closed watercrossing closed 8 years ago

watercrossing commented 8 years ago

Small bug in the cooccur package: it misses out the last token.

$ build/vocab_count -min-count 5 -verbose 2 < text8 > vocab.txt
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 15 < text8 > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005206 tokens.
Writing cooccurrences to disk.........2 files in total.
Merging cooccurrence files: processed 60666466 lines.

vocab_count counts 17005207 tokens, cooccur counts 17005206. The last b token in text8 is ignored.

ghost commented 8 years ago

Okay, we're looking into this.

ghost commented 8 years ago

So it looks like this is actually the correct behavior for the text8 example, here the documents were artificially cut off at exactly 100 million characters. The final word was cut off in the middle and there is no desire to include it in the corpus. I don't think this will be a common problem for most datasets, but it probably won't do much harm either.