Open subins2000 opened 4 years ago
Here's a model of how it looks :
Array of JSON objects :
{
identifier: 'ml-basic',
name: 'Malayalam Basic',
description: 'Collection of basic Malayalam words',
lang: 'ml',
versions: [
{
identifier: 'ml-basic-1',
version: '1',
description: 'Most common words found across many sources',
size: 10
},
{
identifier: 'ml-basic-2',
version: '2',
description: 'Some new-gen words from 2020',
size: 1
}
]
},
{
identifier: 'ml-twitter',
name: 'Malayalam Twitter',
description: 'Collection of words sourced from Twitter',
lang: 'ml',
versions: [
{
identifier: 'ml-twitter-1',
version: '1',
description: 'Most common words found across many sources',
size: 10
}
]
},
{
identifier: 'ml-english',
name: 'English Words in Malayalam',
description: 'Collection of english words written in Malayalam. Eg: KSEB, Facebook',
lang: 'ml',
versions: [
{
identifier: 'ml-english-1',
version: '1',
description: 'Basic words like "try", "last", "first" and many more sourced from social media.',
size: 10
}
]
}
Currently there's sync feature in varnamd. I tried this successfully. It works, but
An alternate solution for users to easily get words would be if
varnamd
provides "language packs"Language Pack
Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :
These packs will be mutually exclusive that is words in one pack won't be in others. Tools to do this are here : https://gitlab.com/smc/corpus/-/tree/master/tools
The words in the files will be sorted by confidence. Sample :
Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.
Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.
varnamd
in server will provide language packs for users to download.varnamd
should also have function to import them, just like how sync works currently. See #22With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22
cc @athul @joicemjoseph