Probability - Githubissues

simon21587 commented 5 years ago

The tutorial says "A state of the art algorithm for text classification is Multinomial Naive Bayes. This is a probabilistic learning method which calculates the probability of a document being in a category.", but the TNTClassifier class only returns 'label' and 'likelihood' (which is a negative number). So how do I get a list of labels and probabilies in percent for a document?

Instead of only two categories, I have - let's say 10 categories and the document may be part of multiple categories. So I'd like to get a list for each document saying:

Category 7: 95% Category 3: 57% Category 4: 35% ...

How do I achive this with your classifier?

nticaric commented 5 years ago

What you could do is to override the predict() methods of the TNTClassfier class and save each likelihood to an array. After that, you would write a softmax function that would give you the percentage for each category

simon21587 commented 5 years ago

Instead of overriding the predict() function, I am adding the following multi_predict() function:

public function multi_predict($statement)
{
    $words = $this->tokenizer->tokenize($statement);
    $types = [];
    $total_likelihood = 0;
    foreach ($this->types as $type) {
        $likelihood = log($this->pTotal($type)); // calculate P(Type)
        $p          = 0;
        foreach ($words as $word) {
            $word = $this->stemmer->stem($word);
            $p += log($this->p($word, $type));
        }
        $likelihood += $p; // calculate P(word, Type)
        $types[$type] = $likelihood;
        $total_likelihood += $likelihood;
    }
    foreach ($types as &$type) {
        $type = $type / $total_likelihood;   
    }
    return $types;
}

Do you have any further suggestions?

nticaric commented 5 years ago

Here you have an example of the softmax function if you want to get probability distributions

https://gist.github.com/raymondjplante/d826df05349c1d4350e0aa2d7ca01da4

simon21587 commented 5 years ago

Thank you!

teamtnt / tutorials

Probability #5