languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.32k stars 1.39k forks source link

Wrong language detection #1210

Open danielnaber opened 6 years ago

danielnaber commented 6 years ago

Issue to collect incorrect language detection even with fasttext. Nothing we can easily fix, but we should at least be aware of the issues:

f-knorr commented 6 years ago

Doesn't it make sense to require a minimal text length before another language is suggested? I have just entered a name (Stephanie) and LT suggests to switch to English. Especially with respect to proper names there may be multiple languages for which a name might be correct (e.g., Maria is a valid/common name in English, German, Italian and maybe others)

danielnaber commented 6 years ago

Maybe, but I don't know where to draw the line. A text might even start with several names... when you say "LT suggests", what client are you referring to?

f-knorr commented 6 years ago

grafik

ghost commented 6 years ago

It might be wise to exclude any capitalized word (proper names mostly) as a quick hack.

SkyCharger001 commented 6 years ago

and perhaps exclude familial prepositions. (EG: Van den Wateren never got the hang of his paternal grand parents' tongue.)

MikeUnwalla commented 3 years ago

English is detected as French on this LinkedIn post:

image

Aside: shouldn't the messages after the suggestions be in French?

danielnaber commented 3 years ago

Could you send the full text as text (not just as a screenshot)?

MikeUnwalla commented 3 years ago

At Congrès Inforsid 2021 (https://inforsid2021.sciencesconf.org/resource/page/id/14) on June 1, Tuesday 2:00 - 5:30 pm. I will present an online workshop (in English) about ASD-STE100.

Controlled language for text simplification: Concepts and implementation

ABSTRACT. In commerce and industry, many organizations use plain language, for example, ‘plain English’ and ‘lenguaje claro’ [Spanish]. For safety-critical documentation, plain language is not always sufficient, and some organizations use controlled language. ASD-STE100 Simplified Technical English is a specification for a controlled language. In this paper, we present the TechScribe term checker for ASD-STE100, which checks a document for conformity to ASD-STE100. Many of the ASD-STE100 rules are applicable to the simplification of scientific texts. To show that, this paper conforms to ASD-STE100 as much as possible.

MikeUnwalla commented 3 years ago

Sorry Daniel, I should have thought to send the full text.

MikeUnwalla commented 3 years ago

If you add the English text directly to a post, there is no problem.

To reproduce the problem, add English text and French text in the same post: image

The French text made the post too long. After I deleted the French text, LT continued to give the warnings.

French text: RÉSUMÉ. Dans le commerce et l'industrie, de nombreuses organisations utilisent des langues simplifiées, par exemple "plain English" et "lenguaje claro" [éspagnol]. Pour la documentation technique, critique pour la sécurité, "plain language" n'est pas toujours suffisant et certaines organisations utilisent une langue contrôlée. L'ASD-STE100 Simplified Technical English est la spécification d’une langue contrôlée. Dans cet article, nous présentons TechScribe, un logiciel conçu pour vérifier automatiquement la conformité d'un document aux règles ASD-STE100. De nombreuses règles de l'ASD-STE100 sont applicables à la simplification des textes scientifiques. Pour le démontrer, dans la mesure du possible, cet article est conforme à la norme ASD-STE100.

danielnaber commented 3 years ago

Thanks, I can reproduce it now. It seems to be add-on-related, I have opened an issue there (https://github.com/languagetooler-gmbh/browser-add-on-rewrite/issues/1263).

MikeUnwalla commented 2 years ago
  1. Start with English (American) and 'Automatically detect language' not selected.
  2. Delete all the text.
  3. Select 'Automatically detect language'.
  4. End with Australian English: image
MikeUnwalla commented 2 years ago

And another: image Click 'Automatically detect language' and LT detects Romanian image