nitotm / efficient-language-detector

Fast and accurate natural language detection. Detector written in PHP. Nito-ELD, ELD.
Apache License 2.0
41 stars 4 forks source link

False positives on the English-only subset. #7

Open pryley opened 9 months ago

pryley commented 9 months ago

The English-only ngrams subset doesn't work too well. Is it trained on the same dataset as the others?

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-1.2rrx014rx6yos0gkkogws8ksc0okcwk.php');
$eld->cleanText(true);
$eld->detect($content);
Nitotm\Eld\LanguageResult {
  +language: "en",
  +scores: [
    "en" => 0.52298237476809,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "en",
  scores: [
    "en" => 0.52298237476809,
  ],
  isReliable(): true,
}
pryley commented 9 months ago

Much better results using the de/en/es/fr/it/nl subset, but I'm not excited about the additional memory usage of larger ngram subsets and I only need English language detection.

$content = 'Nostrum et sapiente in ipsam amet quas ut. Adipisci dolores nihil a facere est voluptas et nostrum. Nobis at laborum odit deleniti ut voluptatem. Modi recusandae ad ut incidunt minima molestiae.';

$eld = new LanguageDetector('ngramsM60-6.5ijqhj4oecso0kwcok4k4kgoscwg80o.php');
$eld->cleanText(true);
$eld->detect($content);
Nitotm\Eld\LanguageResult {#8965
  +language: "it",
  +scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  -numNgrams: 49,
  -avgScore: [...],
  language: "it",
  scores: [
    "it" => 0.30525471552257,
    "en" => 0.29610196351268,
    "fr" => 0.28801600185529,
    "es" => 0.24844426406926,
    "nl" => 0.17167980828695,
    "de" => 0.15610892084106,
  ],
  isReliable(): false,
}
nitotm commented 9 months ago

The isReliable() function can definitely be improved. I would say, as is, ELD is not a good software to know if a string is from a specific language or not, using a one language subset.

The main problem is that when ELD finds only one language on a string, it scores it very high; some accommodation in this regard, for very small subsets or with only one language, should be added.

I am currently finishing ver. 3.0.0, with a new scoring system, which will most likely not solve this issue, but I will leave this "issue" open, to attack this problem next at the ver. 3.0

PD: Since I cannot provide you with a quick fix, in case you still want to try use ELD for this English only scenario, what you would need to do is create your own benchmark of English and non English strings, and modify the 'en' => 0.0378, at resources/avgScore.php, increase it, and try to either calculate/guess the optimal value. Also, before that, in this scenario it might help to decrease $relevancy = 27; at src/LanguageDetector.php, to 820, but I’m just guessing here.

flexchar commented 7 months ago

Keep us posted on your journey to version 3.0 @nitotm!

nitotm commented 7 months ago

It is taking me a bit longer, since I decided to integrate all changes that where long planed, new suggestions, and I keep finding things to improve.

Improvements in accuracy are great, also in efficiency, although since I have added more steps and a bigger database, I’m not sure if it’s finally going to be faster, but the additions are worth it.

nitotm commented 9 hours ago

Well, ELD v3-beta is finally available

Regarding the issue, I would say it is fixed, now no matter the subset, even for single language, the scores will be the same. isReliable() is also quite stable, it will vary a little bit since one its metric is, for the top score to be +5% over the next one, so the less languages, the more positives it will give; But the text also needs to be >75% of the average score for that language.

Also, isReliable() cannot possible be 100% reliable itself, I tried a couple of things to get the least false negatives and false positives, but then it is just a matter of trading them at a desirable ratio, making it more or less conservative.
I have it quite conservative, the benchmark was something like 20% false negatives and 10% false positives (depends also on the database size model, the bigger the better).

I'm open to any suggestion for isReliable() Discussion for v3-beta at: https://github.com/nitotm/efficient-language-detector/discussions/10