findMostFrequentLanguages gives incorrect results

hixus commented 5 years ago

Tried your lib and "findLanguage" seems to work fine but combining languages and then using "findMostFrequentLanguages" seems to find only one language in couple cases.

const test = require("tape");
const { loadModule } = require("cld3-asm");

test("findMostFrequentLanguages", async t => {
  t.plan(7);
  const cldFactory = await loadModule();
  const identifier = cldFactory.create(0, 100);

  const textEN = "This piece of text is in English.";
  const textBG = "Този текст е на Български.";
  const textFI = "Tämä teksti on suomea.";
  const textSV = "Den här texten är på Svenska.";

  const testEN = identifier.findLanguage(textEN);
  t.equal(testEN.language, "en"); // ok

  const testBG = identifier.findLanguage(textBG);
  t.equal(testBG.language, "bg"); // ok

  const testFI = identifier.findLanguage(textFI);
  t.equal(testFI.language, "fi"); // ok

  const testSV = identifier.findLanguage(textSV);
  t.equal(testSV.language, "sv"); // ok

  const testEN_BG = identifier.findMostFrequentLanguages(
    `${textEN} ${textBG}`,
    3
  );
  t.deepEqual(testEN_BG.map(lang => lang.language), ["bg", "en"]); // ok

  const testEN_FI = identifier.findMostFrequentLanguages(
    `${textEN} ${textFI}`,
    3
  );
  t.deepEqual(testEN_FI.map(lang => lang.language), ["fi", "en"]); // not ok, just ["fi"]

  const testEN_SV = identifier.findMostFrequentLanguages(
    `${textEN} ${textSV}`,
    3
  );
  t.deepEqual(testEN_SV.map(lang => lang.language), ["sv", "en"]); // not ok, just ["sv"]
});

hixus commented 5 years ago

I also tried wit 3-5x longer text and the results where same.

kwonoj commented 5 years ago

Example sentence in above snippet is too short even if it's 3-5x longer. Recommended min length from cld3 is 150-200 char per length. Shorter text will greatly decrease accuracy. For accuracy and other feature of cld3 I suggest to file issue https://github.com/google/cld3 - this module doesn't do special handling.

hixus commented 5 years ago

hmm, tried with 400-500 char per lang text combinations and still results remained similar. Will look into cld3 repo you linked.

kwonoj / cld3-asm

findMostFrequentLanguages gives incorrect results #136