landrok / language-detector

A fast and reliable PHP library for detecting languages
MIT License
117 stars 18 forks source link

Add more languages #3

Open rashmiranjanrrs opened 4 years ago

rashmiranjanrrs commented 4 years ago

Hey can you add more Indic language or can you share the pattern or the structure of subset so that I can able to add new languages as per my requirement. How to add new subset ?

landrok commented 4 years ago

Hey,

How to add new subset ?

Subset structure

A subset file is a JSON encoded file with the following structure:

{
  "freq":{"D":662077, [...], "tha":240340},
  "n_words":[260942223,308553243,224934017],
  "name":"en"
}

More

A you may guess, a "learning" tool has to be written to generate a subset. It's not yet packaged with the library but might be in the future. An advise: to generate a reliable subset file, you have to collect a large number of files in the desired language and, if possible, from various language variations.

Hope this helps

devope commented 3 weeks ago

@landrok hey! can you give any advice how to extend your library to support georgian (ka) language (https://en.wikipedia.org/wiki/Georgian_language)