Closed watercrossing closed 8 years ago
Okay, we're looking into this.
So it looks like this is actually the correct behavior for the text8 example, here the documents were artificially cut off at exactly 100 million characters. The final word was cut off in the middle and there is no desire to include it in the corpus. I don't think this will be a common problem for most datasets, but it probably won't do much harm either.
Small bug in the
cooccur
package: it misses out the last token.vocab_count
counts 17005207 tokens,cooccur
counts 17005206. The lastb
token in text8 is ignored.