greyblake / whatlang-rs

Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://whatlang.org/
MIT License
970 stars 109 forks source link
ai algorithm classifier detect-language language language-recognition nlp rust rustlang text-analysis text-classification text-classifier whatlang

Whatlang - rust library for natural language detection

Whatlang

Natural language detection for Rust with focus on simplicity and performance.

Try online demo.

Build Status License Documentation

[![Stand With Ukraine](https://raw.githubusercontent.com/vshymanskyy/StandWithUkraine/main/banner2-direct.svg)](https://stand-with-ukraine.pp.ua/) ## Content * [Features](#features) * [Get started](#get-started) * [Who uses Whatlang?](#who-uses-whatlang) * [Documentation](https://docs.rs/whatlang) * [Supported languages](https://github.com/greyblake/whatlang-rs/blob/master/SUPPORTED_LANGUAGES.md) * [Feature toggles](#feature-toggles) * [How does it work?](#how-does-it-work) * [How language recognition works?](#how-language-recognition-works) * [How is_reliable calculated?](#how-is_reliable-calculated) * [Running benchmark](#running-benchmarks) * [Comparison with alternatives](#comparison-with-alternatives) * [Ports and clones](#ports-and-clones) * [Donations](#donations) * [Derivation](#derivation) * [License](#license) * [Contributors](#contributors) ## Features * Supports [69 languages](https://github.com/greyblake/whatlang-rs/blob/master/SUPPORTED_LANGUAGES.md) * 100% written in Rust * Lightweight, fast and simple * Recognizes not only a language, but also a script (Latin, Cyrillic, etc) * Provides reliability information ## Get started Example: ```rust use whatlang::{detect, Lang, Script}; fn main() { let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!"; let info = detect(text).unwrap(); assert_eq!(info.lang(), Lang::Epo); assert_eq!(info.script(), Script::Latin); assert_eq!(info.confidence(), 1.0); assert!(info.is_reliable()); } ``` For more details (e.g. how to blacklist some languages) please check the [documentation](https://docs.rs/whatlang). ## Who uses Whatlang? Whatlang is used within the following big projects as direct or indirect dependency for language recognition. You're gonna be in a great company using Whatlang: * [Sonic](https://github.com/valeriansaliou/sonic) - fast, lightweight and schema-less search backend in Rust. * [Meilisearch](https://github.com/meilisearch) - an open-source, easy-to-use, blazingly fast, and hyper-relevant search engine built in Rust. ## Feature toggles | Feature | Description | |-------------|---------------------------------------------------------------------------------------| | `enum-map` | `Lang` and `Script` implement `Enum` trait from [enum-map](https://docs.rs/enum-map/) | | `arbitrary` | Support [Arbitrary](https://crates.io/crates/arbitrary) | | `serde` | Implements `Serialize` and `Deserialize` for `Lang` and `Script` | | `dev` | Enables `whatlang::dev` module which provides some internal API.
It exists for profiling purposes and normal users are discouraged to to rely on this API. | ## How does it work? ### How does the language recognition work? The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper [Cavnar and Trenkle '94: N-Gram-Based Text Categorization'](https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization). ### How is `is_reliable` calculated? It is based on the following factors: * How many unique trigrams are in the given text * How big is the difference between the first and the second(not returned) detected languages? This metric is called `rate` in the code base. Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one: Language recognition whatlang rust For more details, please check a blog article [Introduction to Rust Whatlang Library and Natural Language Identification Algorithms](https://www.greyblake.com/blog/introduction-to-rust-whatlang-library-and-natural-language-identification-algorithms/). ## Make tasks * `make bench` - run performance benchmarks * `make doc` - generate and open doc * `make test` - run tests * `make watch` - watch changes and run tests ## Comparison with alternatives | | Whatlang | CLD2 | CLD3 | | ------------------------- | ---------- | ----------- | -------------- | | Implementation language | Rust | C++ | C++ | | Languages | 68 | 83 | 107 | | Algorithm | trigrams | quadgrams | neural network | | Supported Encoding | UTF-8 | UTF-8 | ? | | HTML support | no | yes | ? | ## Ports and clones * [whatlang-ffi](https://github.com/greyblake/whatlang-ffi) - C bindings * [whatlanggo](https://github.com/abadojack/whatlanggo) - whatlang clone for Go language * [whatlang-py](https://github.com/cathalgarvey/whatlang-py) - bindings for Python * [whatlang-rb](https://gitlab.com/KitaitiMakoto/whatlang-rb) - bindings for Ruby * [whatlangex](https://github.com/pierrelegall/whatlangex) - bindings for Elixir ## Donations You can support the project by donating [NEAR tokens](https://near.org). Our NEAR wallet address is `whatlang.near` ## Derivation **Whatlang** is a derivative work from [Franc](https://github.com/wooorm/franc) (JavaScript, MIT) by [Titus Wormer](https://github.com/wooorm). ## License [MIT](https://github.com/greyblake/whatlang-rs/blob/master/LICENSE) © [Sergey Potapov](http://greyblake.com/) ## Contributors - [greyblake](https://github.com/greyblake) Potapov Sergey - creator, maintainer. - [Dr-Emann](https://github.com/Dr-Emann) Zachary Dremann - optimization and improvements - [BaptisteGelez](https://github.com/BaptisteGelez) Baptiste Gelez - improvements - [Vishesh Chopra](https://github.com/KarmicKonquest) - designed the logo - [Joel Natividad](https://github.com/jqnatividad) - support of Tagalog - [ManyTheFish](https://github.com/ManyTheFish) - crazy optimization - [Kerollmops](https://github.com/Kerollmops) Clément Renault - crazy optimization