google / cld3

Apache License 2.0
795 stars 111 forks source link

Language Detection incorrect #36

Open apshar opened 4 years ago

apshar commented 4 years ago

Language detected by CLD is incorrect for below pages :

  1. http://remembertheaba.com/ABAYearlyGameLogs/7273Part3.html
  2. http://lila.science/wp-content/uploads/2020/03/lila_sas_urls.txt

Both the pages are in English but CLD detects Danish and Vietnamese as the language of the pages. LanguageDetection2 LanguageDetection1

ugeshgupta000 commented 3 years ago

image Here is another similar issue which happens for an internal page(Sorry, can not provide the URL).

However, here are the page contents, detected by CLD3: " Azure DevOps outlookweb / Platform / Boards / Work items Platform Overview Boards Work items Boards Backlogs Sprints Queries Plans Retrospectives Portfolio plans (Beta) Tags Repos Pipelines Test Plans Artifacts Compliance Coral Project settings"

Based on these, CLD determines page language as "fr" while indeed this is a "en" page.