varnamproject / varnamd

Varnam daemon which also acts as a HTTP server. Deprecated. See https://github.com/varnamproject/varnamd-govarnam/
MIT License
8 stars 8 forks source link

Feature: "Language Packs" #23

Open subins2000 opened 3 years ago

subins2000 commented 3 years ago

Currently there's sync feature in varnamd. I tried this successfully. It works, but

  1. The downloaded word files are not sorted by confidence, but by ID, so lot of unwanted words come, most with 0 confidence
  2. They're created on the fly from the learnings in server

An alternate solution for users to easily get words would be if varnamd provides "language packs"

Language Pack

Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :

These packs will be mutually exclusive that is words in one pack won't be in others. Tools to do this are here : https://gitlab.com/smc/corpus/-/tree/master/tools

The words in the files will be sorted by confidence. Sample :

ഒരു 1623
മുഖ്യമന്ത്രി 1448
ഈ 1186
സർക്കാർ 769
പറഞ്ഞു 564
എന്ന 530
കോടി 483

Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.

Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.

varnamd in server will provide language packs for users to download. varnamd should also have function to import them, just like how sync works currently. See #22

With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22

cc @athul @joicemjoseph

subins2000 commented 3 years ago

Here's a model of how it looks :

image

Array of JSON objects :

{
  identifier: 'ml-basic',
  name: 'Malayalam Basic',
  description: 'Collection of basic Malayalam words',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-basic-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    },
    {
      identifier: 'ml-basic-2',
      version: '2',
      description: 'Some new-gen words from 2020',
      size: 1
    }
  ]
},
{
  identifier: 'ml-twitter',
  name: 'Malayalam Twitter',
  description: 'Collection of words sourced from Twitter',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-twitter-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    }
  ]
},
{
  identifier: 'ml-english',
  name: 'English Words in Malayalam',
  description: 'Collection of english words written in Malayalam. Eg: KSEB, Facebook',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-english-1',
      version: '1',
      description: 'Basic words like "try", "last", "first" and many more sourced from social media.',
      size: 10
    }
  ]
}