rth / vtext

Simple NLP in Rust with Python bindings
Apache License 2.0
147 stars 11 forks source link

Standardize language option #79

Open rth opened 4 years ago

rth commented 4 years ago

From https://github.com/rth/vtext/pull/78#issuecomment-644009378 by @joshlk

I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.

I am also wondering if in Rust, we should use String for the language parameter or define an Enum e.g.

use vtext::lang

let stemmer = SnowballStemmerParams::default().lang(lang::en).build()

The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.

Also just to be consistent the parameter name would be "lang" not "language", right?