I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).
In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.
I am also wondering if in Rust, we should use String for the language parameter or define an Enum e.g.
use vtext::lang
let stemmer = SnowballStemmerParams::default().lang(lang::en).build()
The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.
Also just to be consistent the parameter name would be "lang" not "language", right?
From https://github.com/rth/vtext/pull/78#issuecomment-644009378 by @joshlk
In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.
I am also wondering if in Rust, we should use
String
for the language parameter or define anEnum
e.g.The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.
Also just to be consistent the parameter name would be
"lang"
not"language"
, right?