mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
990 stars 344 forks source link

NegativeArraySizeException with large vocabulary #65

Open carschno opened 8 years ago

carschno commented 8 years ago

I've tried to compute word embeddings with a vocabulary size of 6105270 with a dimensionality of 300, resulting in a NegativeArraySizeException in WordEmbeddings.java:100:

weights = new double[numWords * stride];

This seems to be due to an integer overflow because numWords * stride = numWords * 2 * numColumns = 6105270 * 2 * 300 = 3663162000 > 2^31.

The solution seems pretty easy: change the type of WordEmbeddings.numWords from int to long.

carschno commented 8 years ago

However, it turns out that this can in fact not be fixed so easily because multiple arrays in WordEmbeddings.java are initialized with the size of numWords, e.g. in line 206:

IDSorter[] sortedWords = new IDSorter[numWords];

Hence, numWords must be an int.