percyliang / brown-cluster

C++ implementation of the Brown word clustering algorithm.
423 stars 132 forks source link

Enable >= 2^31 tokens in input data #15

Closed mannby closed 8 years ago

mannby commented 8 years ago

The vector of tokens in the input internally has large enough indices to handle more than 2 billion tokens, but the clustering code truncates this down to ints.

This patch addresses that, bumps some other sizes that seemed appropriate without too detailed an analysis (so they could be misguided), and keeps most things still using ints. The limit on number of tokens in the corpus gets hit a lot sooner than the limit on word types. E.g. I'm only at 33 million word types in a sample corpus, but 11 billion tokens.

I'm also applying the same patch to generalized-brown.

percyliang commented 8 years ago

Cool, thanks.