The vector of tokens in the input internally has large enough indices to handle more than 2 billion tokens, but the clustering code truncates this down to ints.
This patch addresses that, bumps some other sizes that seemed appropriate without too detailed an analysis (so they could be misguided), and keeps most things still using ints. The limit on number of tokens in the corpus gets hit a lot sooner than the limit on word types. E.g. I'm only at 33 million word types in a sample corpus, but 11 billion tokens.
I'm also applying the same patch to generalized-brown.
The vector of tokens in the input internally has large enough indices to handle more than 2 billion tokens, but the clustering code truncates this down to ints.
This patch addresses that, bumps some other sizes that seemed appropriate without too detailed an analysis (so they could be misguided), and keeps most things still using ints. The limit on number of tokens in the corpus gets hit a lot sooner than the limit on word types. E.g. I'm only at 33 million word types in a sample corpus, but 11 billion tokens.
I'm also applying the same patch to generalized-brown.