vinhkhuc / JFastText

Java interface for fastText
Other
228 stars 100 forks source link

Different results from command line tool #49

Open nirzohar opened 5 years ago

nirzohar commented 5 years ago

The predict-prob method return different results in the java and the native command line tool. Foe example see the results from test05PredictProba in the JFastTextTest class (or test with your own model). The java return probability is: 0.500125 The C++ native tool return probability is: 0.500075

Right, this looks like a minor not important, but when test the probs results with large model files, I see huge gap between the return probabilities.

carschno commented 5 years ago

I've been able to reproduce this and have isolated the issue to the trailing newline character. Apparently, this is rooted in FastText itself; however, the problem probably does not arise there because it operates on line-by-line input, whereas the Java API allows for arbitrary (multi-line) strings.

$ echo "Weak wifi otherwise all ok" | fasttext predict-prob model.bin - 5
__label__60 0.678738 __label__80 0.315212 __label__40 0.0055875 __label__100 0.000415088  __label__20 6.88466e-05

Now without trailing newline:

$ echo -n "Weak wifi otherwise all ok" | fasttext predict-prob model.bin - 5
__label__60 0.807072 __label__80 0.126261 __label__40 0.049052 __label__4 0.00411388 __label__5 0.00340998

Running on the command line, using the java package (created with mvn clean package):

$ echo "Weak wifi otherwise all ok" | java -jar JFastText/target/jfasttext-0.4-jar-with-dependencies.jar  predict-prob model.bin - 5
__label__60 0.678737 __label__80 0.315212 __label__40 0.0055875 __label__100 0.000415088 __label__20 6.88467e-05

Again, without trailing newline:

$ echo -n "Weak wifi otherwise all ok" | java -jar JFastText/target/jfasttext-0.4-jar-with-dependencies.jar  predict-prob model.bin - 5
__label__60 0.807072 __label__80 0.126261 __label__40 0.049052 __label__4 0.00411388 __label__5 0.00340998

In the Java API, this is also reproducible. With trailing newline:

    JFastText jft = new JFastText();
    jft.loadModel("model.bin");
    List<ProbLabel> predictions = jft.predictProba("Weak wifi otherwise all ok\n", 5);

Without trailing newline:

    JFastText jft = new JFastText();
    jft.loadModel("model.bin");
    List<ProbLabel> predictions = jft.predictProba("Weak wifi otherwise all ok", 5);

The results are the same as above with echo and echo -n respectively.

carschno commented 5 years ago

This is actually a known issue in FastText, see: https://github.com/facebookresearch/fastText/issues/435 and https://github.com/facebookresearch/fastText/issues/165

kun368 commented 2 years ago

Based on what @carschno mentioned, I used this to get the right results:

public Map<String, Double> predictTopLabel(String text, int k) {
    Map<String, Double> scoreMap = new LinkedHashMap<>();
    text = StringUtils.trimToEmpty(text) + "\n";
    final List<JFastText.ProbLabel> pl = model.predictProba(text, k);
    for (JFastText.ProbLabel i : CollectionUtils.emptyIfNull(pl)) {
        final double prob = Math.exp(i.logProb);
        final double score = Math.round(prob * 100000000) / 100000000;
        scoreMap.put(i.label, score);
    }
    return scoreMap;
}