mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
988 stars 344 forks source link

ParallelTopicModel Exception java.lang.ArrayIndexOutOfBoundsException: -1 #42

Open tmylk opened 9 years ago

tmylk commented 9 years ago

Hi, I have encountered this issue in my project and wondering if it is in the plans for a fix. It would be great to reliable use ParallelTopicModel in production.

Copying the description from https://github.com/jmcejuela/mallet/issues/3 from 2014 where @ilyastam wrote

Every once in a while I see the following exception thrown:

java.lang.ArrayIndexOutOfBoundsException: -1 at cc.mallet.topics.WorkerRunnable.sampleTopicsForOneDoc(WorkerRunnable.java:489) at cc.mallet.topics.WorkerRunnable.run(WorkerRunnable.java:275) at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:874)

When I went to the location of where exception is being thrown, I saw the following code:

            i = -1;
            while (sample > 0) {
                i++;
                sample -= topicTermScores[i];
            }

            newTopic = currentTypeTopicCounts[i] & topicMask;

It appears that sometimes sample can in fact be less than zero, which legitimately causes java.lang.ArrayIndexOutOfBoundsException to be thrown when jvm runs into newTopic = currentTypeTopicCounts[-1] & topicMask;

This seems like a bug to me. For my purposes I am patching it as follows:

            i = -1;
            while (sample > 0 || i < 0) {
                i++;
                sample -= topicTermScores[i];
            }

            newTopic = currentTypeTopicCounts[i] & topicMask;

I am not sure about the impact of this on the result, but it seems to fix the immediate problem with the code. Would be great to see a proper fix for this though.

jbunzel92 commented 6 years ago

Is there any possibility that this issue will be fixed in the future? I am working with a large number of documents (40000+ docs) and this problem occurs like one time in four runs. Cheers, Julian

mimno commented 6 years ago

Thanks for bumping this! Replicating the problem has been difficult. Could you describe your settings? Number of threads, tokens per document, vocab size, hyperparameter settings, etc.

I think the problem is that sample is exactly equal to 0.0. This could happen by chance but would be fantastically unlikely. It is more likely to happen because the sum of the sampling distribution is zero, which shouldn't be possible if the smoothing parameters are non-zero. The solution in the bug report above always sets the topic to 0 in this case, which doesn't solve the underlying problem and biases the sampler.

jbunzel92 commented 6 years ago

Hey mimno, I tried to look deeper into my issue, but it seems the problem was not this issue that I encountered, but it is also a ArrayIndexOutOfBoundException in WorkerRunnable at line 541.

else {
    //smoothingOnlyCount++;
    sample -= topicBetaMass;
    sample /= beta;
    newTopic = 0;
    sample -= alpha[newTopic] / (tokensPerTopic[newTopic] + betaSum);
    while (sample > 0.0) {
        newTopic++;
        sample -= alpha[newTopic] / (tokensPerTopic[newTopic] + betaSum); //exceeding array here
    }
}

However, this does not seem to break the estimation, because of the bug (https://github.com/mimno/Mallet/issues/33) that was fixed with 2.0.8.

The problems that I am encountering most often right now is an ArrayIndexOutOfBoundException in ParallelTopicModel line 477:

while (targetCounts[targetIndex] > 0 && currentTopic != topic) {
    targetIndex++;
    if (targetIndex == targetCounts.length) {
        logger.info("overflow in merging on type " + type);
    }
    currentTopic = targetCounts[targetIndex] & topicMask;    //AIOOBE 
}

And this one bugs me as well: https://github.com/mimno/Mallet/issues/98

My pipeline is basically the standard pipeline // documents contain between 11 and 90 tokens // alpha = 1, beta = 0.01, iterations = 1000, threads = 16:

       //...
       final InstanceList instances = new InstanceList(docPipe);
        instances.addThruPipe(docsIter);
        final ParallelTopicModel model =
            new ParallelTopicModel(10, 1, 0.01);
        model.setRandomSeed(m_seed);
        model.addInstances(instances);
        model.setNumThreads(16);
        model.setNumIterations(1000);
       //...
jbunzel92 commented 6 years ago

If there is anything I can do to speed up the process of getting this fixed, don't bother to contact me. Also, if there is any information missing, please ask.

MansMeg commented 6 years ago

Another suggestion, if you want a parallel implementation based on Mallet is to use the partially collapsed sampler found here: https://github.com/lejon/PartiallyCollapsedLDA

It is a little memory bloated, but is as fast (or faster) and do not do any ADLDA approximations of the posterior. It may be an alternative for production use (I know it is used in many production settings).

jbunzel92 commented 6 years ago

Hey, thank you. I will have a look at it. :+1:

cbjrobertson commented 5 years ago

Has anyone had any luck fixing this? I am experiencing the same problem.

victorlaerte commented 5 years ago

@jbunzel92 This change https://github.com/mimno/Mallet/pull/154 certainly will help you with this:

The problems that I am encountering most often right now is an ArrayIndexOutOfBoundException in ParallelTopicModel line 477: